Deduplication Constraints

There are some constraints on deduplication when using other IDOL parameters.

Use the Combine Operation

The IDOL Content component cannot use the same ReferenceType field for deduplication as it uses for the Combine action parameter. The Combine operation occurs at query time and clashes with deduplication. If you intend to deduplicate when indexing and use the Combine action parameter, you must set up separate ReferenceType fields for these processes.

Use Deduplication with DIH Reference-Based Indexing

You can enable the DIH for reference-based indexing. Refer to the DIH Administration Guide.

If you index documents into IDOL with the DIH enabled for reference-based indexing, it might prevent deduplication of documents with different references. In this case, use only one of the following deduplication options:

Use Deduplication with DIH Field-Based Indexing

You can use field-based indexing in the DIH to ensure correct deduplication in a distributed system. For more information on configuring the DIH for field-based indexing, refer to the DIH Administration Guide.

If you set KeepExisting to False, or use KillDuplicatesDB options, it might prevent correct deduplication. To deduplicate correctly, you can distribute data by the DeDupeHash field (MD5 hash) of the documents. In this way, DIH sends all duplicates to the same child server. Setting KillDuplicates to DeDupeHash during the indexing action then ensures accurate deduplication.

To use a field for deduplication, you must configure it as a ReferenceType field. You do not need to configure it as ReferenceType in the DIH configuration file.

Deduplication of content occurs for all reference fields specified in a single PropertyFieldCSVs list in the IDOL Content component configuration file. To use only the DeDupeHash field to deduplicate, and not also the DREREFERENCE, you must set these reference fields in separate field processing sections in the IDOL Content component configuration file.