Introduction to Document Classification

Document classification allows you to automatically assign documents to classes according to values that occur in a set of fields that you specify. Classification uses the random forest algorithm, and you can use it as an alternative to the IDOL conceptual categorization (see Categorization).

Classification works by analyzing the contents of various feature fields in the documents. You choose feature fields that contain useful information for classifying the documents. Typically, feature fields contain small snippets of information, rather than large portions of text. For example, you might use a name, or color field, rather than the document content. The feature fields that you use depend on the classifier and classes that you want to create.

To use classification, you create one or more classifiers. A classifier contains a set of classes (similar to categories), which represent the topics that you want to assign documents to. You train each class with a set of documents that represent the kind of content that you want the class to match.

After you create and train a classifier, you can query the classifier with new documents, and the classifier returns the details of the class that each new document belongs to.