Customize Field Standardization

Field standardization modifies documents so that they have a consistent structure and consistent field names. You can use field standardization so that documents indexed into IDOL through different connectors use the same fields to store the same type of information. Field standardization only modifies fields that are specified in a dictionary, which is defined in XML format. A standard dictionary, named dictionary.xml, is supplied in the installation folder of every connector.

In most cases you should not need to modify the standard dictionary, but you can modify it to suit your requirements or create dictionaries for different purposes. By modifying the dictionary, you can configure the connector to apply rules that modify documents before they are ingested. For example, you can move fields, delete fields, or change the format of field values.

The following examples demonstrate how to perform some operations with field standardization.

The following rule renames the field Author to DOCUMENT_METADATA_AUTHOR_STRING. This rule applies to all components that run field standardization and applies to all documents.

<FieldStandardization>
    <Field name="Author">
        <Move name="DOCUMENT_METADATA_AUTHOR_STRING"/>
    </Field>
</FieldStandardization>

The following rule demonstrates how to use the Delete operation. This rule instructs CFS to remove the field KeyviewVersion from all documents (the Product element with the attribute key="ConnectorFrameWork" ensures that this rule is run only by CFS).

<FieldStandardization>
    <Product key="ConnectorFrameWork">
        <Field name="KeyviewVersion">
            <Delete/>
        </Field>
    </Product>
</FieldStandardization>

There are several ways to select fields to process using the Field element.

Field element attribute

Description Example
name Select a field where the field name matches a fixed value.

Select the field MyField:

<Field name="MyField">
   ...
</Field>

Select the field Subfield, which is a subfield of MyField:

<Field name="MyField">
   <Field name="Subfield">
      ...
   </Field>
</Field>
path Select a field where its path matches a fixed value.

Select the field Subfield, which is a subfield of MyField.

<Field path="MyField/Subfield">
   ...
</Field>
nameRegex Select all fields at the current depth where the field name matches a regular expression.

In this case the field name must begin with the word File:

<Field nameRegex="File.*">
   ...
</Field>
pathRegex

Select all fields where the path of the field matches a regular expression.

This operation can be inefficient because every metadata field must be checked. If possible, select the fields to process another way.

This example selects all subfields of MyField.

<Field pathRegex="MyField/[^/]*">
   ...
</Field>

This approach would be more efficient:

<Field name="MyField">
   <Field nameRegex=".*">
      ...
   </Field>
</Field>

You can also limit the fields that are processed based on their value, by using one of the following:

Field element attribute Description Example
matches Process a field if its value matches a fixed value.

Process a field named MyField, if its value matches abc.

<Field name="MyField" matches="abc">
   ...
</Field>
matchesRegex Process a field if its entire value matches a regular expression.

Process a field named MyField, if its value matches one or more digits.

<Field name="MyField" matchesRegex="\d+">
   ...
</Field>
containsRegex Process a field if its value contains a match to a regular expression.

Process a field named MyField if its value contains three consecutive digits.

<Field name="MyField" containsRegex="\d{3}">
   ...
</Field>

The following rule deletes every field or subfield where the name of the field or subfield begins with temp.

<FieldStandardization>
    <Field pathRegex="(.*/)?temp[^/]*">
        <Delete/>
    </Field>
</FieldStandardization>

The following rule instructs CFS to rename the field Author to DOCUMENT_METADATA_AUTHOR_STRING, but only when the document contains a field named DocumentType with the value 230 (the KeyView format code for a PDF file).

<FieldStandardization>
    <Product key="ConnectorFrameWork">
        <IfField name="DocumentType" matches="230"> <!-- PDF -->
            <Field name="Author">
                <Move name="DOCUMENT_METADATA_AUTHOR_STRING"/>
            </Field>
        </IfField>
    </Product>
</FieldStandardization>
TIP:

In this example, the IfField element is used to check the value of the DocumentType field. The IfField element does not change the current position in the document. If you used the Field element, field standardization would attempt to find an Author field that is a subfield of DocumentType, instead of finding the Author field at the root of the document.

The following rules demonstrate how to use the ValueFormat operation to change the format of dates. The only format that you can convert date values into is the IDOL AUTNDATE format. The first rule transforms the value of a field named CreatedDate. The second rule transforms the value of an attribute named Created, on a field named Date.

<FieldStandardization>
    <Field name="CreatedDate">
        <ValueFormat type="autndate" format="YYYY-SHORTMONTH-DD HH:NN:SS"/>
    </Field>
    <Field name="Date">
        <Attribute name="Created">
            <ValueFormat type="autndate" format="YYYY-SHORTMONTH-DD HH:NN:SS"/>
        </Attribute>
    </Field>
</FieldStandardization>

As demonstrated by this example, you can select field attributes to process in a similar way to selecting fields.

You must select attributes using either a fixed name or a regular expression:

Select a field attribute by name <Attribute name="MyAttribute">
Select attributes that match a regular expression <Attribute nameRegex=".*">

You can then add a restriction to limit the attributes that are processed:

Process an attribute only if its value matches a fixed value
<Attribute name="MyAttribute" matches="abc">
Process an attribute only if its value matches a regular expression
<Attribute name="MyAttribute" matchesRegex=".*">
Process an attribute only if its value contains a match to a regular expression
<Attribute name="MyAttribute" containsRegex="\w+">

The following rule moves all of the attributes of a field to sub fields, if the parent field has no value. The id attribute on the first Field element provides a name to a matching field so that it can be referred to by later operations. The GetName and GetValue operations save the name and value of a selected field or attribute (in this case an attribute) into variables (in this case $'name' and $'value') which can be used by later operations. The AddField operation uses the variables to add a new field at the selected location (the field identified by id="parent").

<FieldStandardization>
    <Field pathRegex=".*" matches="" id="parent">
        <Attribute nameRegex=".*">
            <GetName var="name"/>
            <GetValue var="value"/>
            <Field fieldId="parent">
                <AddField name="$'name'" value="$'value'"/>
            </Field>
            <Delete/>
        </Attribute>
    </Field>
</FieldStandardization>

The following rule demonstrates how to move all of the subfields of UnwantedParentField to the root of the document, and then delete the field UnwantedParentField.

<FieldStandardization id="root">
    <Product key="MyConnector">
        <Field name="UnwantedParentField">
            <Field nameRegex=".*">
                <Move destId="root"/>
            </Field>
            <Delete/>
        </Field>
    </Product>
</FieldStandardization>

_HP_HTML5_bannerTitle.htm