Use Document Classification

The following sections describe how to create and train a classifier, and query the classifier with new documents.

For more information about the classification actions, refer to the IDOL Server Reference.

Choose Feature Fields

Before you create a classifier, you must choose the fields in your documents that you want to use to classify documents. These are the feature fields for the classifier.

Feature fields generally contain short pieces of information, such as a name or a very brief description. A good choice of feature field is similar to a good choice of ParametricType field. For example, if you want to create a food classifier, you might use a field that stores ingredients, or a meal name, rather than a field that contains a recipe procedure or a detailed description of a type of food.

The feature fields must contain information that describes features of the different classes that you want to create for your classifier. For example, to classify meals as vegetarian or meat-based, you must find feature fields that describe features of vegetarian or meat-based meals.

The exact choice of feature field also depends on the contents of your documents.

For example, the following IDX document describes part of a recipe for soup:

#DREREFERENCE Food/Carrot and Coriander Soup
#DRETITLE Carrot and Coriander Soup
#DRESECTION 0
#DREFIELD Ingredient="carrots"
#DREFIELD Ingredient="onion"
#DREFIELD Ingredient="potato"
#DREFIELD Herbs="coriander"
#DREFIELD Seasoning="vegetable stock"
#DREFIELD Meal="soup"
#DREFIELD Equipment="food processor"
#DREFIELD PreparationTime="20 minutes"
#DREFIELD CookingTime="1 hour"
#DREFIELD Description"This easy recipe makes a tasty carrot and coriander soup"
#DRECONTENT
Example soup recipe
#DREENDDOC

You can choose more than one feature field for a classifier. The classifier does not distinguish between data from different feature fields. It extracts the content from all the available feature fields from a document, and uses all the content to train the classifier (or classify a document).

For example, if your document had the fields:

#DREFIELD Ingredient1="carrots"
#DREFIELD Ingredient2="onion"
#DREFIELD Ingredient3="potato"

You can set Ingredient1, Ingredient2, and Ingredient3 as feature fields. If you use this document for classification, it gives the same results as if you used a document with the following fields:

#DREFIELD Ingredient1="onion"
#DREFIELD Ingredient2="potato"
#DREFIELD Ingredient3="carrots"

Create a Classifier

You create a classifier with a unique name and a set of feature fields.

To create a classifier

For example:

action=ClassifierCreate&ClassifierName=food&FeatureFields=Ingredient,Herbs,Seasoning

This action creates a food classifier, which uses the Ingredient, Herbs, and Seasoning fields to classify documents.

Create and Train Classes

After you create the classifier, you create and assign training to the classes. You can either create the classes and assign training in a single action, or you can create the classes and train them later. 

The documents that you use to train the class must exist in the HPE IDOL Server data index. You provide training in the form of a state token, which you create by using the Query action with the StoreState parameter set to True. See Choose Training Documents for Classes.

To create a class

For example:

action=ClassifierAddClass&ClassifierName=food&ClassName=vegetarian&StateID=B8UGIK95FKJG-23

This action creates a vegetarian class in the food classifier. It assigns the documents from the state token B8UGIK95FKJG-23 as training for the new class.

If you do not train the class when you create it, you can add training by using the ClassifierSetClassTraining action. You can also use this action to retrain a class. For more information, see Retrain a Class.

You must run the ClassifierAddClass action for each class that you want to create in the classifier.

Choose Training Documents for Classes

When you create a classifier, you must train each of the classes with content that represents the classes that you want to define. The content must exist in your HPE IDOL Server data index, and the content must contain the feature fields that you have defined for the classifier.

You provide training to the classes as a state token. You create state tokens by sending the Query action with the StoreState parameter set to True. Therefore, to train a class, you must have a single query that returns the documents that define that class.

For some classifications, you might be able to perform a complex query that returns enough documents to train your classifier. However, the best way to find training is usually to manually categorize a set of documents, and add a field that labels the document with its class. You can then use a simple FieldText query to find all documents with a particular label.

For example, if you label a set of documents with a MealType field, with a value of savory or dessert, you can use the following query to find and save the results to use as training for the savory class:

action=Query&FieldText=MATCH{savory}:MealType&MaxResults=1000&StoreState=True

You can use the resulting state token that this query returns to train the class. You can also create similar queries to train your other classes.

After you have trained the classifier, you can classify any new documents, and automatically add the label field to those documents.

NOTE:

To get the best results out of your classifiers, use as many training documents as possible. HPE recommends that you use a minimum of 200 to 300 training documents for each class.

Train the Classifier

You must train the classifier before you can use it to classify documents. During this stage, HPE IDOL Server retrieves all the training documents from the index, and extracts the feature fields. It uses the content to train each class in the classifier.

For HPE IDOL Server to successfully train the classifier, it must have at least two classes, each of which must have training assigned.

NOTE:

When HPE IDOL Server trains the classifier, it ignores any very rare features.

To train a classifier

For example:

action=ClassifierTrain&ClassifierName=food

This action trains the food classifier.

NOTE:

The action returns an error if HPE IDOL Server could not extract any features from the training documents (for example, because none of the training documents contained the feature fields for the classifier).

Classify Documents

You can use a trained classifier to classify documents, by using the ClassifierQuery action.

The document can either be:

In both cases, HPE IDOL Server extracts the classifier feature fields from the query document, and compares the values in these feature fields against the trained classes in the classifier. The action returns the class that the document matches most closely.

To classify a document that exists in the index

For example:

action=ClassifierQuery&ClassifierName=food&DocRef=http://www.example.com/documents/carrots

To classify a document that does not exist in the index

For example:

action=ClassifierQuery&ClassifierName=food&QueryText=%23DREREFERENCE%20Food%2FLeek%20and%20Potato%20Pie%0D%0A%23DRETITLE%20Leek%20and%20Potato%20Pie%0D%0A%23DRESECTION%200%0D%0A%23DREFIELD%20Ingredient%3D%22leeks%22%0D%0A%23DREFIELD%20Ingredient%3D%22potatoes%22%0D%0A%23DREFIELD%20Ingredient%3D%22cheese%22%0D%0A%23DREFIELD%20Ingredient%3D%22pastry%22%0D%0A%23DREFIELD%20Ingredient%3D%22butter%22%0D%0A%23DREFIELD%20Ingredient%3D%22egg%22%0D%0A%23DREFIELD%20Herbs%3D%22rosemary%22%0D%0A%23DREFIELD%20Herbs%3D%22thyme%22%0D%0A%23DREFIELD%20Meal%3D%22pie%22%0D%0A%23DREFIELD%20Equipment%3D%22pie%20dish%22%0D%0A%23DREFIELD%20PreparationTime%3D%2210%20minutes%22%0D%0A%23DREFIELD%20CookingTime%3D%221%20hour%22%0D%0A%23DREFIELD%20Description%22This%20easy%20recipe%20makes%20a%20tasty%20leek%20and%20potato%20pie%22%0D%0A%23DRECONTENT%0D%0APie%20recipe%0D%0A%23DREENDDOC

This action classifies the following document:

#DREREFERENCE Food/Leek and Potato Pie
#DRETITLE Leek and Potato Pie
#DRESECTION 0
#DREFIELD Ingredient="leek"
#DREFIELD Ingredient="potato"
#DREFIELD Ingredient="cheese"
#DREFIELD Ingredient="shortcrust pastry"
#DREFIELD Ingredient="butter"
#DREFIELD Ingredient="egg"
#DREFIELD Herbs="rosemary"
#DREFIELD Herbs="thyme"
#DREFIELD Meal="pie"
#DREFIELD Equipment="pie dish"
#DREFIELD PreparationTime="10 minutes"
#DREFIELD CookingTime="1 hour"
#DREFIELD Description"This easy recipe makes a tasty leek and potato pie"
#DRECONTENT
Pie recipe
#DREENDDOC

_HP_HTML5_bannerTitle.htm