Extracting Document Features

Document clustering is a technique for analyzing a set of documents to create groups (clusters) of them that address the same subjects. Document summarization is a technique for presenting a list of keywords or a short passage that summarizes the content of a document.

Clustering and content summarization rely on the Verity feature extraction technology, which operates during collection indexing. Feature extraction automatically discovers the subjects addressed in a document by performing vector analysis on nouns and noun phrases. Each document’s feature vector is then stored in the collection for use in clustering and summarization.

Clustering and summarization are collection-level features. Therefore, the administrator must enable them at indexing time, by specifying the generation of feature vectors during collection indexing.



For more information on clustering and summarization, see Clustering Results andReturning Document Summaries. For more information on feature vectors, see the Verity Developer’s Kit Programming Reference and the Verity Collection Reference.