Collection Requirements for Thematic Mapping

Thematic mapping works only on collections built with indexed nouns and noun phrases. For information on building this type of collection, see the Verity Collection Reference. The default filter for PDF documents in style.uni is flt_kv to enable you to index nouns and noun phrases in PDF documents.


Note   To ensure the quality of concept extraction using thematic mapping, the recommended minimum number of documents for a collection is 2048 documents. A smaller number of documents might still work.


Optimize the collection with a spanning word list. You can have the Business Console taxonomy module build the span word list when it finds the collection does not have a spanning word list for certain operations. You can also perform the optimization on the command line before starting Business Console:

mkvdk -collection <collection_name> -optimize spanword -locale <locale_name>

See the Verity Collection Reference for more details.

In order to extract noun and noun phrases properly, Verity Business Console's session locale must be consistent with the collection's locale, which means that you must use the same locale as the collection’s locale.

System Requirements

The thematic mapping processes take place on the master host where Business Console resides. The processes require a substantial amount of memory space, and potentially a substantial amount of disk space. The amount of memory space and disk space required depends on the:

Number of key terms extracted from the collections (see the next section for details)


Size of the collections (the number of documents in the collections)


When the amount of memory space available is insufficient, a temporary cache file is created for storing the intermediate data. The cache file is created under the path specified by the %TMP% (or %TEMP% if %TMP% is not defined) environment (user) variable. If neither %TMP% nor %TEMP% is defined, the cache file is created in Windows' default directory C:\Winnt. In UNIX, the cache file is created in the UNIX /tmp/ directory. By default, both variables point to %USERPROFILE%\Local Settings\Temp.

The process will fail if the disk drive on which the cache file is located runs out of space. When the collections to be processed contain a large number of documents, ensure that there is a substantial amount of disk space. The cache file is removed after the process finishes.