Locate Duplicate Documents

You can locate duplicate documents in the data index after indexing has taken place by using the DREDUPLICATE index action. This action locates duplicates in a specified subset of the content, and then removes them, tags a field, or moves the duplicate documents to another database.

http://IDOLhost:indexPort/DREDUPLICATE?ReferenceField=Field&DuplicateAction=Action

where:

IDOLhost is the IP address or host name of the machine on which HPE IDOL Server is installed.
indexPort is the HPE IDOL Server index port (specified as IndexPort in the [Server] section of the HPE IDOL Server configuration file).
Field is a ReferenceType field used as the initial determination of whether two documents are a match.
Action

is the action to perform on a duplicate. The following options are available:

  • Delete. Deletes all duplicate documents.

  • Database. Moves all duplicate documents to a database. If you select the Database action, you must specify the database in the Database parameter.

  • Tag. Tags a specified field in the duplicate document. You must specify the field in the TagField index parameter. You can also specify a value to tag the field with by using the TagValue parameter. If you do not specify a TagValue, the field is tagged with the value 1.

For example:

http://MyHost:20001/DREDUPLICATE?ReferenceField=DOCUMENT/DREREFERENCE&DuplicateAction=Database&Database=Duplicates

This action uses port 20001 to remove duplicates from the HPE IDOL Server that is located on the machine with the host name MyHost. HPE IDOL Server uses the DREREFERENCE field to identify duplicate documents, and moves them to the Duplicates database.

http://MyHost:20001/DREDUPLICATE?ReferenceField=DOCUMENT/DREREFERENCE&DuplicateAction=Tag&TagField=DOCUMENT/DRETITLE&TagValue=Duplicate

In this example, HPE IDOL Server initially uses the DREREFERENCE field to identify the duplicate documents, and then it changes the DRETITLE field to the value Duplicate.

To prevent HPE IDOL Server from indexing duplicate documents, use the KillDuplicates parameter with the DREADD and DREADDDATA index actions.

For details on the other parameters that are available for the DREDUPLICATE index action, refer to the HPE IDOL Server Reference.


_HP_HTML5_bannerTitle.htm