Character Tokenization

You can tokenize characters into N-grams of a specified size. Set the NGram configuration parameter in your language configuration section to the number of characters to use in each N-gram group.

NOTE:

You must not use NGram with the SentenceBreaking configuration parameter.

For example, if you set NGram to 2, HPE IDOL Server tokenizes the word Hello as:

he el ll lo

To tokenize only multibyte strings, set NGramMultiByteOnly to True.

[Japanese]
NGram=2
NGramMultiByteOnly=True

For this configuration, if you have a document that contains both English and Asian (multibyte) text, HPE IDOL Server tokenizes the Asian text according to the NGram parameter. It does not tokenize the English text.

To tokenize only multiple-byte strings in Chinese, Japanese, and Korean characters (and ignore multiple-byte strings in other languages), set NGramOrientalOnly to True.

[Japanese] 
NGram=2
NGramOrientalOnly=True

For this configuration, if you have a document that contains multibyte text in both Japanese and Greek, HPE IDOL Server tokenizes the Japanese text according to the NGram parameter. It does not tokenize the Greek multibyte text.


_HP_HTML5_bannerTitle.htm