Open topic with navigation
The following sections describe how to create and train a classifier, and query the classifier with new documents.
For more information about the classification actions, refer to the IDOL Server Reference.
Before you create a classifier, you must choose the fields in your documents that you want to use to classify documents. These are the feature fields for the classifier.
Feature fields generally contain short pieces of information, such as a name or a very brief description. A good choice of feature field is similar to a good choice of
ParametricType field. For example, if you want to create a food classifier, you might use a field that stores ingredients, or a meal name, rather than a field that contains a recipe procedure or a detailed description of a type of food.
The feature fields must contain information that describes features of the different classes that you want to create for your classifier. For example, to classify meals as vegetarian or meat-based, you must find feature fields that describe features of vegetarian or meat-based meals.
The exact choice of feature field also depends on the contents of your documents.
For example, the following IDX document describes part of a recipe for soup:
#DREREFERENCE Food/Carrot and Coriander Soup #DRETITLE Carrot and Coriander Soup #DRESECTION 0 #DREFIELD Ingredient="carrots" #DREFIELD Ingredient="onion" #DREFIELD Ingredient="potato" #DREFIELD Herbs="coriander" #DREFIELD Seasoning="vegetable stock" #DREFIELD Meal="soup" #DREFIELD Equipment="food processor" #DREFIELD PreparationTime="20 minutes" #DREFIELD CookingTime="1 hour" #DREFIELD Description"This easy recipe makes a tasty carrot and coriander soup" #DRECONTENT Example soup recipe #DREENDDOC
You can choose more than one feature field for a classifier. The classifier does not distinguish between data from different feature fields. It extracts the content from all the available feature fields from a document, and uses all the content to train the classifier (or classify a document).
For example, if your document had the fields:
#DREFIELD Ingredient1="carrots" #DREFIELD Ingredient2="onion" #DREFIELD Ingredient3="potato"
You can set
Ingredient3 as feature fields. If you use this document for classification, it gives the same results as if you used a document with the following fields:
#DREFIELD Ingredient1="onion" #DREFIELD Ingredient2="potato" #DREFIELD Ingredient3="carrots"
You create a classifier with a unique name and a set of feature fields.
To create a classifier
ClassifierCreate action to the IDOL Category component, with the following parameters:
ClassifierNameset to the name of the new classifier. This name must be unique in the IDOL Category component.
FeatureFieldsset to a comma-separated list of the feature fields that you want to use for the classifier.
This action creates a
food classifier, which uses the
Seasoning fields to classify documents.
After you create the classifier, you create and assign training to the classes. You can either create the classes and assign training in a single action, or you can create the classes and train them later.
The documents that you use to train the class must exist in the IDOL data index (IDOL Content component). You provide training in the form of a state token, which you create by using the
Query action with the
StoreState parameter set to
True. See Choose Training Documents for Classes.
To create a class
ClassifierAddClass action to the IDOL Category component, with the following parameters:
ClassifierNameset to the name of the classifier.
ClassNameset to the name of the new class.
StateIDset to a state token that lists the documents that you want to use to train the class.
This action creates a
vegetarian class in the
food classifier. It assigns the documents from the state token
B8UGIK95FKJG-23 as training for the new class.
If you do not train the class when you create it, you can add training by using the
ClassifierSetClassTraining action. You can also use this action to retrain a class. For more information, see Retrain a Class.
You must run the
ClassifierAddClass action for each class that you want to create in the classifier.
When you create a classifier, you must train each of the classes with content that represents the classes that you want to define. The content must exist in your IDOL data index, and the content must contain the feature fields that you have defined for the classifier.
You provide training to the classes as a state token. You create state tokens by sending the
Query action with the
StoreState parameter set to
True. Therefore, to train a class, you must have a single query that returns the documents that define that class.
For some classifications, you might be able to perform a complex query that returns enough documents to train your classifier. However, the best way to find training is usually to manually categorize a set of documents, and add a field that labels the document with its class. You can then use a simple
FieldText query to find all documents with a particular label.
For example, if you label a set of documents with a
MealType field, with a value of savory or dessert, you can use the following query to find and save the results to use as training for the
You can use the resulting state token that this query returns to train the class. You can also create similar queries to train your other classes.
After you have trained the classifier, you can classify any new documents, and automatically add the label field to those documents.
To get the best results out of your classifiers, use as many training documents as possible. HPE recommends that you use a minimum of 200 to 300 training documents for each class.
You must train the classifier before you can use it to classify documents. During this stage, the IDOL Category component retrieves all the training documents from the index, and extracts the feature fields. It uses the content to train each class in the classifier.
For Category to successfully train the classifier, it must have at least two classes, each of which must have training assigned.
When Category trains the classifier, it ignores any very rare features.
To train a classifier
ClassifierTrainaction to the IDOL Category component, with the
ClassifierNameparameter set to the name of the classifier.
This action trains the
The action returns an error if IDOL Category component could not extract any features from the training documents (for example, because none of the training documents contained the feature fields for the classifier).
You can use a trained classifier to classify documents, by using the
The document can either be:
In both cases, the IDOL Category component extracts the classifier feature fields from the query document, and compares the values in these feature fields against the trained classes in the classifier. The action returns the class that the document matches most closely.
To classify a document that exists in the index
ClassifierQuery action with the following parameters.
ClassifierNameset to the name of the classifier to use to classify the document.
DocRefset to the IDOL reference of the document to classify.
To classify a document that does not exist in the index
ClassifierQuery action to the IDOL Category component with the following parameters.
ClassifierName set to the name of the classifier to use to classify the document.
QueryTextset to the percent-encoded IDX or XML document (Category detects the correct format automatically).
This action classifies the following document:
#DREREFERENCE Food/Leek and Potato Pie #DRETITLE Leek and Potato Pie #DRESECTION 0 #DREFIELD Ingredient="leek" #DREFIELD Ingredient="potato" #DREFIELD Ingredient="cheese" #DREFIELD Ingredient="shortcrust pastry" #DREFIELD Ingredient="butter" #DREFIELD Ingredient="egg" #DREFIELD Herbs="rosemary" #DREFIELD Herbs="thyme" #DREFIELD Meal="pie" #DREFIELD Equipment="pie dish" #DREFIELD PreparationTime="10 minutes" #DREFIELD CookingTime="1 hour" #DREFIELD Description"This easy recipe makes a tasty leek and potato pie" #DRECONTENT Pie recipe #DREENDDOC