looks_like_language

The looks_like_language function analyzes document content. The function aims to determine whether the content contains text in the specified language. You can use this function to check whether Optical Character Recognition has successfully extracted meaningful text from a file.

Syntax

looks_like_language( doc [, params] )

Arguments

Argument Description
doc (LuaDocument) The document that you want to check.
params (table) Additional named parameters that configure the analysis performed. The table maps parameter names (String) to parameter values. For information about the parameters that you can set, see the following table. For information about how to use named parameters refer to the HPE Connector Framework Server Administration Guide.

Named Parameters

Named Parameter Description Configuration Parameter
section (string) The name of a section in the CFS configuration file. If you set this then any parameters not set in the parameters table are read from this section of the configuration file.  
threshold

(integer) The maximum quality score that a document can have and match the specified language.

The quality score is an integer in the range 0-200, where lower numbers indicate higher quality. The default is 75.

Threshold
term_file (string) The filename of the language termlist. TermFile
stop_list (string) The filename of the language stoplist. StopList
language (string) The name of the language against which the document is checked. The default is “ENGLISH”. Language
encoding (string) The expected type of encoding used to encode the document. The default is “UTF-8”. Encoding
minimum_valid_terms (integer) The minimum number of valid terms required for a document to contain content in the specified language. If you do not specify a value, this check is ignored. MinimumValidTerms
minimum_percentage_terms_in_language (integer) The minimum percentage of valid terms required for a document to contain content in the specified language. If you do not specify a value, this check is ignored. PercentageLanguageTerms
maximum_percentage_terms_not_in_language (integer) The maximum percentage of invalid terms. If the percentage of invalid terms exceeds this limit, the document does not match the specified language. If you do not specify a value, this check is ignored. PercentageNonLanguageTerms
punctuation (string) Characters that are considered to be punctuation characters. Specify all of the characters in a single string, for example ".£,%()$" Punctuation
maximum_percentage_punctuation (integer) The maximum percentage of characters in a document that can be punctuation characters. If the percentage of punctuation characters exceeds this limit, the document does not match the specified language. If you do not specify a value, this check is ignored. PercentagePunctuation
maximum_percentage_alphanumeric_terms (integer) The maximum percentage of terms in a document that can contain numbers. If the percentage of alphanumeric terms exceeds this limit, the document does not match the specified language. If you do not specify a value, this check is ignored. PercentageAlphaNumerical
classify_short_documents

(boolean) Specifies how short documents are processed:

  • False. Short documents always pass the test.
  • True. Short documents are subject to the same criteria as other documents.
ClassifyShortDocuments
quality_score_field (string) The name of a new field that is created in the document, and filled with the numeric score (threshold). If you do not specify a value, the field is not created. QualityScoreField
report_field (string) The name of a new field that is created in the document, and filled with a report on the language state of the document. If you do not specify a value, the field is not created. ReportField

Returns

Boolean. Returns True if the document met all of the conditions, and is likely to contain content in the specified language and encoding.


_HP_HTML5_bannerTitle.htm