Open topic with navigation
You can tokenize characters into N-grams of a specified size. Set the
NGram configuration parameter in your language configuration section to the number of characters to use in each N-gram group.
You must not use
NGram with the
SentenceBreaking configuration parameter.
For example, if you set
2, HPE IDOL Server tokenizes the word Hello as:
he el ll lo
To tokenize only multibyte strings, set
[Japanese] NGram=2 NGramMultiByteOnly=True
For this configuration, if you have a document that contains both English and Asian (multibyte) text, HPE IDOL Server tokenizes the Asian text according to the
NGram parameter. It does not tokenize the English text.
To tokenize only multiple-byte strings in Chinese, Japanese, and Korean characters (and ignore multiple-byte strings in other languages), set
[Japanese] NGram=2 NGramOrientalOnly=True
For this configuration, if you have a document that contains multibyte text in both Japanese and Greek, HPE IDOL Server tokenizes the Japanese text according to the
NGram parameter. It does not tokenize the Greek multibyte text.