Information-Extraction Services

The core capabilities of Verity K2 focus on accessing documents in heterogeneous repositories to extract information and build data structures needed by information-access applications.

Administrators and knowledge workers (or indexing applications that use the verity APIs) can use these information-extraction services to create collections and other types of indexes that support search and information access.

Figure 1-2 shows some of the available information-extraction services. More information on each is available elsewhere in this book.

Indexing. Before users can search or classify enterprise information, it generally must be indexed. The K2 system uses gateways and the Verity engine to gather information into a universal index called a collection.

 



The main purpose of a collection is to support sophisticated, multi-featured text search. A collection stores the locations of all indexed documents and a list of essentially all words contained within the text of those documents. (A collection does not contain the actual documents themselves.)

Classification. Beyond search, information access can also involve classification, in which documents are organized into one or more taxonomies, which are browsable, searchable, hierarchies of categories. Verity K2 includes several information-extraction services related to creating and populating taxonomies.

 

Parametric indexes. Parametric indexes are structures built on top of collections or other sets of documents. These indexes support both parametric search, the ability to search heterogeneous collections of documents containing both structured metadata (parameters) and free text, and taxonomy browse, the ability to navigate through a taxonomy of category links to arrive at desired documents.

For example, documents that describe cars might have structured attributes such as Color, Price, Make, Model, Location, and Year, plus free-text descriptions of the cars. Also, a taxonomy applied to the descriptions might organize the documents by manufacturer, make, model, and year.



The parametric-selection portion of a parametric search would involve the user selecting the desired attributes from the structured data. The free-text portion would involve searching for terms in the descriptions. Taxonomy browse would involve navigating through the hierarchy of links to find the desired model and year. (See, for example, Figure 4-1.)

Relational taxonomies. Verity parametric indexes can further include a high-level classification concept called relational taxonomies. With relational taxonomies, more than one taxonomy is applied to a parametric index. Users can simultaneously navigate the multiple taxonomies, drilling down and jumping from one to the other, navigating to the information they seek in the manner most intuitive to them. (See, for example, Figure 4-3.)

Thematic Mapping is a process that automatically extracts key concepts from a set of documents, constructs a taxonomy from them, and assigns the documents to the taxonomy to create a parametric index. (See Building the Taxonomy.)

Profiling. Using the K2 Profiler, an application can automatically classify incoming documents, assigning them to one or more categories based on criteria such as subject areas of interest to specific users.

 

The categories used for profiling are implemented in structures called profile nets, which are stored queries that knowledge workers can create manually or with the help of command-line tools. (See Creating Profile Nets.)

Feature Extraction. When displaying search results, a K2 application can provide a summary of each document and it can cluster related documents together on the page.

 

Clustering, thematic mapping, and some kinds of summarization rely on an underlying process called feature extraction, in which the most important key words and concepts in a document are automatically extracted and saved during collection indexing.

See Extracting Document Features for more details.

Entity Extraction. K2 includes a component called the Verity Extractor, which is an engine that applications can use to extract entities—words or blocks of text that have specific meaning (for example, names, telephone numbers, URLs, addresses, product IDs)—from a document or set of documents.

 

Applications can use the results of extraction to populate collection fields or taxonomy categories, to forward the entity information to other analytical programs, or to validate or route documents based on entity rules. See Extracting Entities.

Entity Profiling (for Recommendation). The Verity Recommendation Engine is a K2 component that brings sophisticated, high-level socialization and personalization capabilities to applications. Recommendation applications can connect users to social networks, not only providing results for search queries but also adaptively ranking results, locating experts, recommending alternative documents, and connecting the user with communities of people with similar interests.

 

The K2 Recommendation Engine stores the information it needs in entity profiles (recommendation indexes) that record the activities and preferences of users and the actions taken on various documents and other entities. The information changes over time as the system evolves. See Providing Recommendations.