Stop Word Lists for Supported Languages

A stop word list (stop list) is a list of common words that the IDOL Content component does not index. Words such as the or a occur too frequently to carry any significance, and Content does not require them to understand the concept of text. Using a stop list to remove these words can improve query results and performance, and save index space.

Each language that Content supports needs a stop list; if the Content installer does not include a stop word list for the language that you want to use, you can create one.

You can use a standard text editor to create or edit a stop list. Stop word lists are located in the Content IDOL/langfiles directory. For example, you might want to add any words that occur in most or all of your documents, and which you do not need to search for.

For all operations, Content recognizes words as stop words irrespective of the encoding they are in. For example, in Russian you can list a stop word in the UTF-8 encoding in the stop word list file, and Content recognizes it if it occurs in a document in KOI8 encoding.

For simplicity, HPE recommends that you type all the terms in the stop list in UTF-8 encoding. However, you can list the words in the stop list in any of the valid encodings for that language. For example, in Russian you can specify stop words in KOI8, UTF8, ISO, and so on.

You can specify words in uppercase or lowercase, and you can separate them with spaces or new lines.


If necessary, you can use different encodings in the same stop list file. You need to specify each word only once; that is, you do not need to specify the same word in several different encodings.

For each encoding that you want to use, create a section in your stop list file. Give the section the same name as the language type that you are using (for example, cyrillic_koi8, cyrillic_utf8).

For example: