Open topic with navigation
If your IDOL license includes automatic language detection, the IDOL Content component can automatically identify the language and encoding of a document when it is indexed. Content analyzes a certain amount of text in the document content fields (fields for which
SourceType is set to
True in the IDOL Content component configuration file).
Open the IDOL Content component configuration file in a text editor.
[Server] section and add this setting:
True if you do not want to index documents whose language Content cannot recognize. For example, it might not recognize the language because the document does not contain language, or it might not have enough text for Content to determine the language.
By default, Content indexes the document using the default language type. It also logs a warning message in the index log, so that you can add an appropriate language type.
You can change the amount of text that Content analyzes to detect the language of a document. By default, Content uses only a few sentences. In some situations, increasing the amount of text to analyze can give more accurate results, such as when significant amounts of a minor second language are present.
By default, Content detects any 7-bit ASCII characters as UTF-8. If you instead want to group these documents with documents using 8-bit ASCII, disable the
LangDetectUTF8 parameter by setting it to
Ensure that the encoding option you want is present in the language type configuration (see Define Language Types). If there are no compatible encodings configured for the detected language, IDOL assigns the default language type.
Save and close the configuration file.
Restart the IDOL Content component for your changes to take effect.
If you enable automatic language detection and set up a field process that reads the language of a document from one of its fields, Content uses the field process rather than autodetection to determine the document language and encoding.