MustHaveCheck
 
Type

Long

Default

0

Allowed range

Minimum: 0

Maximum: 4045

Recommended range

Minimum: 0

Maximum: 4045

Required

no

Configuration section

[Default] and [indvidual_spider]

Description

A bitwise mask number used to determine which pages to discard. Enter a bitwise mask number specifying where strings (defined in the parameter MustHaveCSVs) must not appear for a page to be discarded. You can create the number by adding together some of the following numbers as appropriate:

URL: 1
If you enter 1, the connector determines whether the URL of a page contains any of the strings specified in the parameter MustHaveCSVs. If the URL does not contain any of these strings, the connector discards the page.

Page header: 4
If you enter 4, the connector determines whether the HTML <HEAD> tag of a page contains any of the strings specified in the parameter MustHaveCSVs. If the tag does not contain any of these strings, the connector discards the page.

Page content: 8
If you enter 8, the connector determines whether the content of a page contains any of the strings specified in the parameter MustHaveCSVs. If the content does not contain any of these strings, the connector discards the page.

Case insensitive: 64
If you add 64 to the MustHaveCheck value, the connector determines whether specified parts of a page contain any of the strings specified in the parameter MustHaveCSVs. If the connector does not find a case-insensitive match for any of the MustHaveCSVs strings, it discards the page.

Note: If you specify 64, you must also specify another value to indicate which part of the page the connector should check for the MustHaveCSVs strings.

Before download: 128
If you add 128 to the MustHaveCheck value, the connector determines whether a page contains any of the strings specified in the parameter MustHaveCSVs before it downloads it. If a page does not contain any of the MustHaveCSVs strings, it is not downloaded.

Note: If you specify 128, you must also specify another value to inidicate which part of the page the connector should check for the MustHaveCSVs strings.

Spider check cache URL: 256
If you enter 256, the connector checks previously retrieved URLs from the spider structure cache whenever MustHaveCSVs is modified to determine whether the URLs have changed. If a URL fails this check, it is deleted.

Note:If you specify 256, you must also specify 1 (URL).

Valid site structure: 512
If you enter 512, the connector rechecks the MustHaveCSVs values for the site to ensure the site is still valid before it updates it. If you do not include this setting, then changes to these values are never checked. If the site is not valid, it is not downloaded.

Spider strip content: 1024
If you enter 1024, the connector unescapes any HTML entities in downloaded pages. This can affect other functionalities, for example if the date format of a page contains HTML entities, these are removed before a date check is performed.

Spider check content type: 2048
If you enter 2048, the connector checks the content type of the page for the strings specified in MustHaveCSVs before downloading it. If the content type does not contain any of the MustHaveCSVs strings, it is not downloaded.

If you enter 0, the connector does not check for MustHaveCSVs.

Example

MustHaveCheck=77

In this example, the connector checks the URLs, headers, and content of pages for case-insensitive matches of any MustHaveCSVs strings. If the URL, header, or content of a page do not contain any of these strings, the connector discards the page.

See also

CantHaveCheck

CantHaveCSVs

MustHaveCSVs