About Taxonomies

A taxonomy is a hierarchal organization of categories. It is typically represented as a tree structure or directed graph, with a single, most general category at the top, branching downward to subcategories, which in turn branch into deeper subcategories. For example, Figure 3-5 shows a small portion of a taxonomy that classifies automobiles by manufacturer, product line, and model name.

Figure 3-5    A taxonomy of automobile models


Note   This taxonomy is not the only way to classify automobile brand names. You could devise completely different taxonomies that might instead classify automobiles by manufacturing location, consumer market segment, price, or other features. K2 allows the use of multiple taxonomies to categorize the same information; for more details, see About Relational Taxonomies.


Using taxonomies is a flexible approach to classification that can combine domain expertise with automatic techniques to apply a hierarchical structure to information from many different kinds of documents. Creating a taxonomy organizes your information assets into categories that are easy for users to understand and browse. There are four basic steps to implementing taxonomies in K2:

1. Build the taxonomy.

2. Create definitions for each category in the taxonomy.

3. Populate the taxonomy with documents.

4. Use the taxonomy.

Building the Taxonomy

A taxonomy is a skeletal navigation structure of named categories. It defines the views by which you want to organize your content. Thus, for a zoological taxonomy you might have a category Animal, with child categories Bird and Cat. These children may in turn have descendants, and so on.

The result of this stage is typically a tree structure of names. (It is possible, however, for a category to have more than one parent category, in which case the structure is more generally characterized as a directed graph.)

There are several techniques you can use to build a taxonomy:

Use a domain expert. This is a common method for building a taxonomy. A domain expert builds the skeleton and assigns names to the categories. Yahoo! and some other corporations employ this method.


Import a taxonomy. This technique allows users to extract implicit hierarchies from existing URL or file system hierarchies, or metadata, and mirror them in a taxonomy.


Purchase an industry taxonomy. Vendors such as Lexis-Nexis or Factiva provide taxonomies for particular industry segments.


Use thematic mapping. In many enterprises, the information explosion has reached a point where it is impossible for a human to envision all the various themes and topics represented in the corpus—the entire collection or body of knowledge—of the enterprise. K2 provides a thematic-mapping capability that extracts key concepts from a set of documents, constructs a taxonomy from them, and assigns the documents to the taxonomy. Thematic mapping also generates human-readable labels for the concepts (categories) in the taxonomy.


For more information on thematic mapping, see the Verity Knowledge Console Guide.

Creating Category Definitions

The next step is to attach a definition to each category to control how it is populated with documents. A category definition consists of a mathematical rule against which each document can be evaluated for membership in that category. For example, a very simple rule for the category Animal might be “If a document contains the word paw or hoof, it belongs to this category.” The result of this stage is a taxonomy with attached category definitions.


  Note   For certain types of taxonomies (rule-based taxonomies), category definitions are part of the taxonomy itself and do not need to be attached in a separate step.


There are several methods you can use to create category definitions:

Business rules. A domain expert defines a rule for each category. Verity’s powerful topics feature (see Creating Topic Sets) is one means for generating such business rules.


Import. Document membership in categories mirrors a file system or URL hierarchy. This method corresponds to the import technique for building the taxonomy. Both the taxonomy structure and the membership of documents in categories mirrors a specified structure such as a file system.


Industry taxonomy. A standards body, an independent vendor, or Verity Professional Services creates category definition rules for an industry taxonomy. This method corresponds to the industry taxonomy technique for building the taxonomy. Both the taxonomy structure and the membership of documents in categories are specified by the industry taxonomy.


Automatic classification. Verity provides an automatic classification system, called the Logistic Regression Classifier (LRC), which uses state-of-the-art machine learning technology to perform this task. The LRC is fed positive or negative example documents for membership in each category. The system learns from these “training documents” and creates a defining rule for each category.


For more information on the LRC, see the Verity Knowledge Console Guide.

Populating the Taxonomy

Once the taxonomy is built and each of its categories is defined, you can populate it with documents. The result of this stage is a fully functional taxonomy, which might be a topic set or portion of a parametric index.

You either accomplish this task either manually or automatically:

Manual. Experts determine what document belongs in what categories, and they populate the category nodes in the taxonomy accordingly. Yahoo! populates its Web taxonomy in this way.


Automatic. A Verity tool evaluates each document against the rule for each category and assigns the document to the appropriate categories in the taxonomy. This approach takes advantage of the business-rules category definitions described earlier.


The best approach is often to use a combination of the automatic and manual methods.

Using the Taxonomy

Studies show that with a well-implemented taxonomy, users use a balanced combination of search and browsing to locate documents. The search could be issued at the top level to filter the categories in which matching documents exist, or scoped within a subset of the taxonomy.

To accommodate the different ways that various groups use information, you can create multiple taxonomies to organize content in ways that make the most sense for each group. For example, separate taxonomies can be created for the sales, marketing, human resources, and engineering departments. This puts information into the context of your overall business model, and adds a valuable dimension to the content discovery process.