Extracting Entities


K2 developers can create applications that use a product called Verity Extractor, which is an engine that applications can use to extract entities—words or blocks of text that have specific meaning (for example, names, telephone numbers, URLs, addresses, product IDs)—from a document or set of documents.

With Verity Extractor your application can

Identify and extract elements from document content, based on predefined grammars and rules.

 

Use extracted elements to support analytical applications or metadata-enriched search and browse applications.

 

Validate documents based on a defined patterns or rules.

 

The Verity Extractor product includes these components:

Core engine. Supports the extraction of predefined entities using prebuilt resource files (dictionaries and grammars).

 

Dictionaries. A dictionary is an XML file that provides a vocabulary for a simple entity, such as a city or country name. A dictionary has a list of headwords; each headword has a set of associated words called synonyms. Verity Extractor uses the dictionary to scan a document and extract the defined entities that match the headwords or synonyms.

Grammars. A grammar is an XML file that provides rules for complex entities such as URLs or postal addresses. Rules are written in regular-expression format, can be recursive, and can refer to other grammars and dictionaries. Verity Extractor uses the grammar to scan a document and extract entities that match the rules.

C API. Allows you to write a C application that performs entity extractions.

 

Java API. Allows you to write a Java application that performs entity extractions.

 

Command-line tool (mkve). Allows you to perform entity extractions from the command line and generate output in multiple formats.

 

Entity extraction filter (flt_ve). A document filter that extracts entities during the K2/VDK indexing process, allowing entities to be extracted and immediately stored in Verity collection fields.

 

Common entity resource files. A basic set of pre-packaged, U.S.-specific dictionary and grammar files for common entities such as person, place, organization, address, phone number, email address, date and time.

 

Figure 3-4 shows the Verity Extractor being used to extract entities into a Verity collection during the indexing process.


Figure 3-4    Verity Extractor (shaded) used during indexing



Table 3-1 lists examples of some of the entities that can be extracted with the common entity resource files.

 

Table 3-1    Some common entities


Entity

Description/Components

Example(s)

Commercial organization

Business name

Business type

Business designator

Verity, Inc.

Cardiff Software

Person(s)

Title

Given name

Family name

Suffix

President George W. Bush

Harold Potter

Mr. and Mrs. John Doe

Mary and John Doe

Place(s)

City

State

Country

Region

Tokyo

Sunnyvale, California

United Kingdom

Northern and Eastern Europe

Postal address

 

894 Ross Drive, Sunnyvale

PO Box 9090, Beverley Hills, 90210

Internet address

 

http://www.verity.com

Money

 

$200.00

$50 million

Date

Any time designated by a particular day, month, or year

Monday

last August

1/1/2004

300 B.C.

July 4th

Dec. 2000


With Verity Extractor, you can also create your own custom dictionaries and grammars, to refine its capabilities or extend them beyond the common entities.

For complete information about entity extraction and all components of Verity Extractor, see the Verity Extractor Programming Guide.