Data Sources

Depending on the nature of your organization’s information and the kinds of tasks you want to perform, you may find that different sources are most appropriate for different kinds of information structures.

Sources for Collections

Information from repositories that passes through gateways and document filters is indexed into Verity collections, where it is made available for search by K2 applications. The best sources for searchable data are any of the repositories that hold your enterprise’s unstructured and semi-structured information. For example, all word-processing documents, spreadsheets, memos, emails, discussion threads, and presentation documents related to a given project can be indexed together into a single collection related to the project.

More structured data that is searchable on its own (such as database information) might also be beneficial to index into a collection, so that it will be searchable from within a K2 application, along with all the other document types.

Sources for Parametric Indexes

Parametric indexes support parametric selection, in which the user can select certain parameters—for example, color or manufacturer or price range in the case of an automobile—to narrow the scope of a search.

In parametric selection, each parameter that a user can choose must be related to a collection field or XML element. Therefore, the kinds of data sources most useful for parametric selection are those that are semi-structured—that is, documents in which a significant amount of information can be turned into fields:

Database records


HMTL catalog pages


XML files


Spreadsheet documents


XML files are useful also because you can build parametric indexes on them directly, without first indexing them into a collection.

Sources for Entity Extraction

Entity extraction is the process of recognizing and capturing small-scale text structures—names, addresses, dates, numeric or monetary values, and the like—from unstructured text. Given that, the best sources to which to apply Verity Extractor might be documents that are entity-rich. Documents such as telephone directories, employee lists, activity logs, ledgers, account statements, and transaction records might be especially promising candidates for entity extraction.

On the other hand, the purpose might not be to extract large numbers of entities, but to instead locate only entities of a specific kind within a large body of documents. In that case, the entity extraction can be applied to any kind of readable unstructured or structured document.

Sources for Recommendations

If a K2 application is set up to recommend documents, experts, or other entities to a user, those recommendations are based on historical searching behavior and feedback provided by that user and other similar users. The recommendations are dynamic; they come from the interaction between users and documents.

Therefore, the kinds of information sources you might want to use for recommendations include the following:

All your indexed enterprise data that is available for searching.


Published data authored by your users. If your organization uses a document management system, you can use Recommendation Engine APIs to import users’ documents and other information from it.


Emails, from which the Recommendation Engine can extract author and content information for updating user profiles.


Employee information.