The Indexing Process

Figure 3-1 summarizes the process of indexing a set of documents in a repository.

1. One by one, repository documents are opened (by K2 Spider by or some other indexing tool).

2. Each opened document’s format—such as Microsoft Word, Lotus 1-2-3, or HTML—is automatically identified.

3. K2 invokes the correct document filter for the format and extracts a word index from the document’s content.

4. K2 extracts document metadata (fields or attributes) for indexing into the collection’s document table. Several predefined metadata fields are supported, including author, date and title. Custom fields are also supported. You define custom fields with a name and data type, such as text, date, or numbers.

By extracting field values through gateways, K2 preserves the structure of structured and semi-structured documents so that users can target desired content more precisely.


Figure 3-1    Indexing documents into a collection



The collection’s document table also includes an identifying key (such as physical file system address or URL) for each document in the collection.

A collection can optionally store data for document zones and fields. A zone is a named region of text in a document, such as the area in the <h2> or <body> tags in an HTML file. A collection that includes zones allows users to search these regions of a document for specific content.

For more information about building collections, see the Verity Command-Line Indexing Reference, the Verity Collection Reference., and the Verity K2 Dashboard Administrator Guide.

 

 
Note   Verity also provides API s for indexing collections from within an application. For more information, see the discussion of the Collection Indexing API in Client APIs.