Open topic with navigation
To enable deduplication for all indexing jobs—in other words, to set deduplication by default for the
DREADDATA actions—use the
KillDuplicates configuration parameter in the
[Server] section of the configuration file.
You must enable deduplication before you start indexing documents into the IDOL Content component.
You can use the
KillDuplicatesChecksumField parameter to configure Content to reverse normal deduplication and retain the existing document instead of the incoming document, based on the value of a specified field in the incoming document.
You can use the
KillDuplicatesPreserveFields parameter to configure one or more IDX fields that Content copies to a newer version of a duplicate document.
Open the IDOL Content component configuration file in a text editor.
[Server] section, set the
KillDuplicates parameter to
REFERENCEMATCHN, the names of the
ReferenceType fields to use to determine which documents are duplicates, or a combination of
ReferenceType field and a field that contains a document version number. For more information about these options, see Deduplication Options—KillDuplicates, or refer to the IDOL Server Reference.
You can identify fields that contain document references by setting up an appropriate field process. When you index a document that has the same value in the same
ReferenceType field as an existing document in Content, Content detects the duplicate. It deletes the existing document and replaces it with the new one.
Save and close the configuration file.
Restart the IDOL Content component for your changes to take effect.
You can now index documents into the IDOL Content component.
You identify fields as
ReferenceType fields through field processes. If you list multiple fields in the same
PropertyFieldCSVs parameter where you list the
FieldName for deduplication, Content uses all the fields to eliminate duplicate documents. For example:
[SetReferenceFields] Property=Reference PropertyFieldCSVs=*/DREREFERENCE,*/URL
In this example, Content uses both the
DREREFERENCE field and
URL field to eliminate duplicate copies if you set
If you want to define multiple
ReferenceType fields but do not want to use them all for duplicate elimination, set up multiple field processes. For example:
[SetReferenceFields] Property=Reference PropertyFieldCSVs=*/DREREFERENCE [SetMoreReferenceFields] Property=Reference PropertyFieldCSVs=*/URL
In this example, Content uses only the
DREREFERENCE field to eliminate duplicate copies if you set
DREREFERENCE. It does not use the
By default, when Content detects that a new document is a duplicate of an existing one, it replaces the existing document with the new one.
For either of these two
KillDuplicates options, you can also use the
KillDuplicatesChecksumField configuration parameter to specify a checksum field. Content then checks the value of this field in both documents. If the value is the same, Content keeps the existing document rather than replacing it with the new document.
This process prevents unnecessary updates. For example, when refetching a Web site, use
KillDuplicatesChecksumField to configure Content to update the index for this site only if the site has changed.
KillDuplicatesChecksumField must be a
If there is a field that you want to keep in all versions of a document, regardless of whether it is later deleted or changed, you can use the
KillDuplicatesPreserveFields configuration parameter.
To preserve fields, set
KillDuplicatesPreserveFields to a comma-separated list of fields that you want to save.
When Content receives a duplicate document, it copies this field from the existing version of the document to the newer version when it performs
If there is more than one copy of the document in the Content index when a new version arrives, Content copies the preserve field from the existing duplicate with the highest document ID.