Introduction to Categorization

The IDOL Category component automatically organizes text documents of any type into predefined categories. These sections describe how to adapt the categorization process to obtain the best possible performance.

For example, start with a data set of 1,000 news stories and a list of categories such as Sports, Politics, Entertainment, Science, and Business. The categorization process has two principal stages: training and testing. Divide the data set into training data, which might consist of 800 of the stories, and test data, which contains the remaining 200.

In the training stage, Category uses the training data to build the agents that it uses later to categorize the test data. A human expert sorts the data into the categories, by reading each news story and deciding which category it belongs to. More than one category might be appropriate for some stories. For example, you could place a story about patenting the human genome in both Science and Business. Other stories might not have an appropriate category, so they are discarded. After the training data has been sorted manually, you train the agents by running the training sections of the Categorizer.

After training, you are ready to categorize additional documents. You enter the test documents into Category, which automatically places them into the category or categories that its mathematical rules decide are most appropriate. Similar to the expert sorting the training data, Category might place a particular document in more than one category, or in no category at all.

The human expert must also sort the 200 test documents. You can then examine the performance of Category and determine how well it categorizes. If it is categorizing optimally, you can then add any future news stories into Category to categorize automatically. If required, you can add the original 200 test documents to the training set.