Defining Categories

After the taxonomy has been created, the next step is to attach category definitions to the taxonomy so that the taxonomy can be populated with documents. Each category definition consists of a mathematical rule against which each document can be evaluated for membership in that category. A very simple (and incomplete) rule for the category Animal could be defined as follows:

“If a document contains the word “paw” or “hoof,” include the document in the Animal category.”

As with creating taxonomies, there are several techniques for building rules that define categories. They include the following:

Expert-defined rules


A domain expert can use Verity’s topics to define a rule for each category. For more information about topics, see What is a Topic?. These rules are sometimes called business rules because the domain expert typically tailors them so that they are relevant to a specific business or business function.

Importing from an existing hierarchy


You can use this technique to define categories by extracting the implicit hierarchies from existing URLs or file system hierarchies, or hierarchies defined in metadata such as the Dewey decimal number in a library catalog. This technique is useful if documents’ membership in categories corresponds to an implicit hierarchy, such as a file system or URL hierarchy. In this situation, the first two stages, building the taxonomy and defining categories, are combined into one stage.

Using industry “standard” categories


A standards body or independent vendor can create category definition rules for an industry’s vertical taxonomies. This technique is closely associated with using industry-specific taxonomies. Using industry-standard taxonomies with standard categories can be combined, creating another situation in which the stages of building a taxonomy and defining categories can be combined.

Automatic category creation


An automatic classification system that creates categories is fed positive and negative example documents, called training documents, that respectively denote membership in or exclusion from each category. The system “learns” from these training documents and creates a defining rule for each category. Verity’s Logistic Regression Classifier (LRC), which is based on machine learning technology, implements automatic category creation.

Automatic creation using thematic mapping


When a taxonomy or subtaxonomy is created using thematic mapping, each category in the taxonomy or subtaxonomy is associated with a list of concept terms (nouns and noun phrases). These concept terms can be used to automatically create a topic for the category.

With the Verity classification infrastructure, you can implement any combination of these methods. Best results are often achieved by combining automatic capabilities with domain expertise.