Train Passage Extractor Classifiers

The Answer Server Passage Extractor uses a question classifier to determine what type a question is, and therefore what entities (if any) to extract from candidate answers. The type refers to the type of information that the question is requesting. For example, the question How many points make up a perfect fivepin bowling score? is looking for a number, while the question What is an annotated bibliography? is looking for a description.

The question classifier is always required. The Passage Extractor system does not return any answers without it.

To configure the question classifier, you can train one, or use a saved one from a previous training run.


You can save classifiers only if you set the ClassifierFile and LabelFile configuration parameters in your Passage Extractor system configuration. See Configure the Passage Extractor System.

The Answer Server installation ZIP package includes a default set of classifier files passage extractor installation.

To use the pre-trained classifier, copy the DAT files from the installation ZIP package to a suitable location. Then update the ClassifierFile and LabelFile configuration parameters to point to the en_svm.dat and en_labels.dat files respectively. See Configure the Passage Extractor System.

The following sections provide more information about how to create and train your own classifiers.

Create a Training File

For the English language, the Answer Server installation includes a suitable training file. For other languages, you might need to create your own training file to describe the kind of question classifications that you expect to send to your Passage Extractor.

Each line of the training file defines a label and an example question, in the following format:

Label;Example Question

The example questions are the training. The label specifies the kind of information that the question is requesting. For example, the first few lines of the training file might be:

DESC:desc;What did the only repealed amendment to the U.S. Constitution deal with?
NUM:count;How many points make up a perfect fivepin bowling score?
DESC:def;What is an annotated bibliography?
NUM:date;What is the date of Boxing Day?

The default training file uses a Text Retrieval Conference (TREC) classification system to specify question classifiers. HPE recommends that you use this classification system, which is based on a commonly used set. For more information, see Training File Labels. However, you can use your own classification system if required.

Train a Classifier

To train the question classifier, you use the ManageResources action, which accepts a JSON object with the details of the training file. For example:


Where the JSON object takes the following form:

   "operation": "train",
   "type": "classifier",
   "trainingfile": "classifier_training.txt",
   "savemodel": "true"

If you do not want to save the training model (for example, during testing), set savemodel to false. You cannot set savemodel to true unless you have configured the ClassifierFile and LabelFile parameters for your Passage Extractor system.

The trainingfile parameter sets the location and name of a suitable training file. The training file contains a set of training questions, and a label that specifies the sort of answer that the question is looking for (for example, a person, place, or description).

You can use the GetResources action to retrieve the whole JSON schema for the operation, in the same way as for Answer Bank systems. See Find the JSON Schema for Your Update.

Classifier Behavior File

In addition to the main classifier and label files, there is a classifier behavior file, which is available in the Answer Server installation.

The classifier behavior file contains details of question classifications that it must treat differently. In particular, it includes information about whether to always or never consider other question classifications when a particular classification is identified as the primary classification. 

For example, you generally want to consider other location classifications when a question matches the LOC:other classification. Similarly, for classifications that match descriptive questions you can explicitly never include other classifications, because classifications that match entities are less relevant, but might score higher in the results.

The primary classification is determined by a probability threshold, which is 0.85 by default.

If you move or rename the classification behavior JSON file, modify the ClassifierBehaviorFile configuration parameter to specify the new name and location.