Character Tokenization

You can tokenize characters into N-grams of a specified size. Set the NGram configuration parameter in your language configuration section to the number of characters to use in each N-gram group.

NOTE:

You must not use NGram with the SentenceBreaking configuration parameter.

For example, if you set NGram to 2, Content tokenizes the word Hello as:

he el ll lo

To tokenize only multibyte strings, set NGramMultiByteOnly to True.

[Japanese]
NGram=2
NGramMultiByteOnly=True

For this configuration, if you have a document that contains both English and Asian (multibyte) text, Content tokenizes the Asian text according to the NGram parameter. It does not tokenize the English text.

To tokenize only multiple-byte strings in Chinese, Japanese, and Korean characters (and ignore multiple-byte strings in other languages), set NGramOrientalOnly to True.

[Japanese] 
NGram=2
NGramOrientalOnly=True

For this configuration, if you have a document that contains multibyte text in both Japanese and Greek, Content tokenizes the Japanese text according to the NGram parameter. It does not tokenize the Greek multibyte text.


_HP_HTML5_bannerTitle.htm