Change the Number and Size of Clusters

There are two main stages to the clustering process:

Build Seeds

HPE IDOL Server builds seeds when you send the ClusterSnapshot action. HPE IDOL Server takes a sample of the documents that it stores, and tries to associate individual documents with each other, based on the similarity of the concepts that the documents contain. Each group produced at this stage, containing a sample document and similar documents, is a seed.

HPE IDOL Server stops trying to build a seed when the seed meets the requirements that SeedSize specifies or when there are no more documents that meet the similarity requirement that SeedBindLevel specifies (whichever condition is reached first). HPE IDOL Server discards any seeds that do not reach the required size.

The number of clusters that you specify with NumClusters affects the number of sample documents that HPE IDOL Server tries to create seeds from. You can adjust the relationship between the number you specify here and the size of the sample used by changing the value of StartingSuggestOverrideFactor.

Group Seeds into Clusters

HPE IDOL Server groups seeds into clusters when you send the ClusterSGDataGen or ClusterCluster actions. HPE IDOL Server tries to create clusters by grouping seeds together. The grouping is based on the similarity of the concepts that the seeds or clusters contain.

Clustering is complete when one of the following conditions is met:

HPE IDOL Server discards clusters that do not meet the quality requirement set by BindLevel or the size requirement set by MinClusterDocs.

For details of the clustering actions, and the settings you can make to generate the clusters from your data, refer to the HPE IDOL Server Reference.

Configuration Parameters

The ideal values for the parameters that affect clustering depend on the nature and amount of data in your HPE IDOL Server. You can use the SentientClustering parameter for the ClusterSnapshot action to automatically determine the correct values for SeedSize and SeedBindLevel.

This section makes general recommendations about how to manually alter these parameters according to your data. Parameters are closely interdependent, so make these changes in combination with each other (rather than just changing one of the settings). Change values in small steps.

Although you can make many changes to clustering, the number and size of clusters that HPE IDOL Server can identify depends ultimately on the data content that it contains. You can:

Cluster a Small Amount of Data

If your HPE IDOL Server has a small amount of data, it is likely to identify fewer clusters, because it is less likely that your data contains a lot of similar documents for several different topics. You can edit the following parameters to change clustering in this situation.

NOTE:

Ideally, your HPE IDOL Server must contain at least 500 documents.

SeedSize Decrease SeedSize (by three to four points at a time). This option reduces the size that seeds must reach, which means that more seeds are likely to be successfully created.
MinClusterDocs Decrease MinClusterDocs so that clusters that contain fewer documents are not discarded.
StartingSuggestOverrideFactor Increase StartingSuggestOverrideFactor (by one or two points only). This increases the number of sample documents from which HPE IDOL Server creates seeds, which in some cases increases the possibility of finding clusters in the data.
SeedBindLevel Decrease SeedBindLevel (by one point at a time) to reduce the similarity threshold for clusters. Do not change this value until you try changing SeedSize, because lowering SeedBindLevel is more likely to allow less-relevant documents into clusters.

Cluster a Large Amount of Data

If your HPE IDOL Server has a large amount of data, you probably do not need to edit any clustering parameters, because this is the situation in which clustering is most successful. In some cases (for example, if your HPE IDOL Server contains more than a million documents), it can be beneficial to alter the following parameter.

StartingSuggestOverrideFactor Increase the value of this parameter to increase the number of sample documents from which HPE IDOL Server creates seeds. This is sometimes necessary to allow a broader section of the data content to be represented by the clusters that are created.

Cluster Very Similar Data

If the documents in your HPE IDOL Server contain highly similar concepts, HPE IDOL Server might identify a small number of large clusters. For example, if your HPE IDOL Server contains mostly documents about sports, then you might get one large sports cluster. This situation is a realistic characterization of the data in your HPE IDOL Server, but in many circumstances is not useful. You can edit the following parameters to generate smaller, more specific clusters (for example, breaking sports into football, tennis, golf, and so on).

SeedBindLevel

Increase SeedBindLevel to require greater similarity between the documents that form a seed, which can reduce the breadth of topics covered by the concepts in the seed documents.

NOTE:

Increase SeedBindLevel one point at a time. Increasing by too much can result in seeds being discarded because they do not contain enough documents.

BindLevel Increase BindLevel to require greater similarity between the concepts in seeds or clusters that merge to create a cluster. This change can decrease the size of clusters, as well as increase the number of clusters identified, because merging seeds and clusters together stops at an earlier stage.

Cluster Very Different Data

If the documents in your HPE IDOL Server contain a wide variety of concepts, there might not be enough similar documents for HPE IDOL Server to create seeds or clusters that characterize the data that it stores. You can lower the similarity requirement with the following parameters.

SeedBindLevel

Decrease SeedBindLevel to reduce the similarity requirement between the documents that form a seed, which can increase the breadth of topics covered by the concepts in the seed documents.

NOTE:

Decrease SeedBindLevel one point at a time. Decreasing by too much can result in seeds and clusters that contain documents that are less relevant, because the similarity requirement is too low.

BindLevel Decrease BindLevel to reduce the similarity requirement between the concepts in seeds or clusters that merge to create a cluster. This change can increase the size of clusters, as well as increase the number of clusters identified (because fewer are discarded for not meeting the quality requirement).

Change the Data View

It might be the case that although HPE IDOL Server identifies clusters that characterize your data successfully, you want to change the view of the data that clustering creates. The following parameters enable you to change the data view that clusters generate.

NumClusters Increase NumClusters to obtain a more low-level view of your data by identifying more clusters. Decrease NumClusters to obtain a more high-level view by identifying fewer clusters.
MinClusterDocs Decrease MinClusterDocs to reduce the number of clusters that are discarded. This option allows you to identify smaller clusters. Increase MinClusterDocs to increase the number of clusters that it discards. Only larger clusters are kept.
BindLevel Decrease BindLevel to reduce the similarity requirement between the concepts in seeds or clusters that merge to create a cluster. This option can increase the size of clusters, as well as increase the number of clusters identified (because it discards fewer clusters for not meeting the quality requirement). Increase BindLevel to increase the similarity requirement between the concepts in seeds or clusters that merge to create a cluster. This can decrease the size of clusters, as well as increase the number of clusters identified, because merging seeds and clusters together stops at an earlier stage.

 


_HP_HTML5_bannerTitle.htm