Store Content in IDOL Server > Language Support > IDOL Language Support Concepts

IDOL Language Support Concepts
IDOL server uses probabilistic modeling and therefore does not require any form of language-dependent parsing, dictionaries or translation modules.
Treating words as abstract symbols of meaning allows Autonomy technology to derive understanding through the context in which symbols occur rather than a rigid definition of grammar. Slang and other variations in language do not affect the software analysis.
IDOL server can build up a statistical understanding of the patterns in any language. The more information IDOL server has about a particular type of information (for example, legal terms, pharmaceutical developments, technology and so on), the more understanding it gains of those topics.
You can think of a new language as simply another type of information, for which IDOL server needs enough material to learn from. Therefore, it is possible to mix more than one language in IDOL server as long as you have sufficient amounts of each language to build its understanding.
The choice of language does not compromise the accuracy of the concepts extracted by IDOL server. The underlying algorithm is the same regardless of the language used.
Autonomy's internationalization functionality enables:
automatic language detection. IDOL server can automatically detect the language and encoding of documents that it processes. This feature allows you to set up processes that IDOL server automatically applies to documents or document metadata if they are in a specific language. For example, if IDOL server identifies a document as Chinese, it automatically applies the appropriate preliminary linguistic tools.
NOTE If a document contains multiple languages, IDOL server determines which language it contains most, and processes the document according to the settings for this language.
cross-lingual systems. You can set up cross-lingual systems in IDOL server. This feature allows you to produce multilingual results for queries or to restrict results to documents in a specific language or encoding. For example, an English query may return information both in English and Spanish.
While Autonomy technology is language independent, it can be beneficial to use language dependent features to optimize the ability of IDOL server to match concepts irrespective of their appearance in text. Autonomy therefore provides the following features:
stop word lists. Every language has words that do not carry much significant meaning. In grammatical terms these are normally prepositions, conjunctions, auxiliary verbs and so on (for example, words such as the, a, and, to in English). These words can be safely ignored when processing content.
Autonomy provides as standard a set of stop word lists for the most commonly used languages.
stemming. In languages, some words have a common morphological root. Autonomy provides stemming algorithms that reduce words to this form. This process allows you to match concepts regardless of the grammatical use of words. In English for example, the words help, helpful, helping and helped can all be stripped to their stem help without significant loss of meaning.
Autonomy provides as standard a set of stemming algorithms for the most commonly used languages. IDOL applies stemming after it discards stop words, both at index time (when content is stored in IDOL server) and at query time (IDOL removes stop words and stems query text before matching).
NOTE IDOL server also supports per-language use of a stemming file, which you can use in conjunction with the stemming algorithms to specify stems for individual words.
multiple encodings. Autonomy supports multiple encodings for languages such as Greek and Russian. You can use different encodings interchangeably, which means that it does not matter which encoding a language is given in. For example, it is possible to query in one recognized encoding for a language and receive results that are in other encodings.
transliteration schemes. Transliteration is the ability to represent letters that do not belong to the Latin alphabet or words that contain accented letters with the corresponding characters of another alphabet. This makes familiarity with the accents and special characters of different languages unnecessary.
canonicalization of characters. Some encodings have more than one way to represent a character. For example, the Japanese katakana script can have full width or half width characters. Regardless of its width the character in itself carries the same meaning.
The Autonomy software infrastructure uses canonicalization to ensure that it treats all character forms equally. It automatically converts to an internationally recognized canonical form.
Related Topics