IDOL Server Operations > Categorization > Introduction to Categorization

Introduction to Categorization
You can use Categorizer or Autonomy Collaborative Classifier to create categories. This chapter describes how to use Categorizer. Refer to the Autonomy Collaborative Classifier User Guide for information on creating categories using the taxonomy module.
Categorizer automatically organizes text documents of any type into predefined categories. These sections describe how to adapt the categorization process to obtain the best possible performance.
For example, start with a data set of 1000 news stories and a list of categories such as Sports, Politics, Entertainment, Science, and Business. The categorization process has two principal stages: training and testing. Divide the data set into training data, which might consist of 800 of the stories, and test data, which contains the remaining 200.
In the training stage, the Categorizer learns from the training data to build the agents that it uses later to categorize the test data. A human expert sorts the data into the categories, by reading each news story and deciding which category it belongs to. More than one category might be appropriate for some stories. For example, you could place a story about patenting the human genome in both Science and Business. Other stories might not have an appropriate category, so they are discarded. After the training data has been sorted manually, you train the agents by running the training sections of the Categorizer.
After training, you are ready to categorize additional documents. You enter the test documents into the Categorizer, which automatically places them into the category or categories that its mathematical rules decide are most appropriate. Similar to the expert sorting the training data, the Categorizer might place a particular document in more than one category, or in no category at all.
The human expert must also sort the 200 test documents. You can then examine the performance of the Categorizer and determine how well it categorizes. If it is categorizing optimally, you can then add any future news stories into Categorizer to categorize automatically. If required, you can add the original 200 test documents to the training set.
Related Topics 
*