Open topic with navigation
You can tokenize characters into N-grams of a specified size. Set the
NGram configuration parameter in your language configuration section to the number of characters to use in each N-gram group.
You must not use
NGram with the
SentenceBreaking configuration parameter.
For example, if you set
2, Content tokenizes the word Hello as:
he el ll lo
To tokenize only multibyte strings, set
[Japanese] NGram=2 NGramMultiByteOnly=True
For this configuration, if you have a document that contains both English and Asian (multibyte) text, Content tokenizes the Asian text according to the
NGram parameter. It does not tokenize the English text.
To tokenize only multiple-byte strings in Chinese, Japanese, and Korean characters (and ignore multiple-byte strings in other languages), set
[Japanese] NGram=2 NGramOrientalOnly=True
For this configuration, if you have a document that contains multibyte text in both Japanese and Greek, Content tokenizes the Japanese text according to the
NGram parameter. It does not tokenize the Greek multibyte text.