K2 Spider

The Verity K2 Spider manages the indexing process that builds your collections. To build a collection, K2 Spider uses gateways to gain access to data and documents contained in repositories.

For example, if an enterprise wants its knowledge workers to be able to search Documentum files, the Documentum Gateway obtains access for K2 Spider, and then extracts properties, fields, data, and security schemas from the Documentum repository. K2 Spider then communicates with the gateway and gathers these attributes into a collection.

When users query for a Documentum file, K2 searches against the collection and displays a list of results. If a user selects a document for viewing, K2 uses the Documentum Gateway to access the selected file in the repository and then display it.


Figure 3-2    Using K2 Spider for indexing



Distributed Indexing

K2 Spider distributes processes to controllers, crawlers, and indexers. Documents can be indexed in real time as they are created or modified, without disrupting current indexing jobs. K2 crawlers “walk” through information sources as indexes are built. The K2 Spider controller manages the crawlers, indexers, jobs, and workflow. As shown in Figure 3-2, the controller, crawlers, and indexers can be on separate machines.

A Spider job is the actual process of building the collection. Using optional features, the K2 Spider can also crawl secure Web servers that use SSL certificates and Web servers with cookies.

K2 Spider’s distributed architecture provides scalability and fault tolerance. For example, you can configure K2 Spider to index one repository but create two separate, identical collections on different servers. If one server slows down, K2 Spider continues to update the collection on the other server, ensuring up-to-date query results.

Similarly, if K2 Spider detects a corrupted or unreadable document, it does not halt the indexing operation. Rather, it logs the corrupted document and continues to index the remaining documents, guaranteeing that they become part of the collection. You can then use the log information to locate and repair the corrupted documents in the repository.

The K2 administrator controls K2 Spider’s indexing through the K2 Dashboard or through the command-line tools k2spider_srv and k2_spider_cli.

Continuous Indexing

K2 Spider continuously monitors the state of your collections. Its crawlers are persistent and regularly crawl documents based on the indexing frequency you set. They can also be configured to index documents as soon as they are created or modified, and can be integrated with content management system workflow processes. Collections can be updated continuously and automatically, even while searches are in progress. You do not have to repeatedly schedule which documents to index.

Customizing K2 Spider

You can configure K2 Spider to suit your specific requirements through its published APIs. K2 Spider includes C and Java APIs that give you flexibility and control in building indexed collections. The APIs enable you to build your own administrative tool or user interface to control different aspects of the indexing process.

K2 Spider also provides style files (configuration files) for setting up indexing preferences. Style files are used to control different aspects of a collection, such as the summary data displayed in query results or the fields available for searching. For both optimal indexing performance and ease of administration, style files can be created and edited through the StyleSet Editor, a Windows-based interface. The SSM lets you visually explore repositories, define fields to extract for indexing and map repository fields to collection fields. For more information on style files, see Manually Editing Style Files.