Creating Virtual Documents

A virtual document is a document that your gateway creates from data in the repository, which is delivered to the Verity engine. A virtual document need not have any physical relationship to data in the repository; it typically has a logical relationship, which may or may not be the same logical relationship as one defined in the repository. A virtual document simply defines a “view” of the repository data to access. From the Verity engine’s perspective, a virtual document is a stream of tokens provided by the gateway.

There are several models you can use to represent different mappings between repository content and tokens:

the content in a repository is at the same level, in which there is a one-to-one mapping between an item in the repository and a single virtual document; for example, an employee record in a personnel system. These kinds of documents are monolithic in structure, meaning that there is little or no differentiation between the kinds of content in the document—the document is autonomous

 

the content in a repository is organized into files, in which several files become a single virtual document; for example, a set of word processing and spreadsheet files used to create a financial report

 

the content in a repository has a parent-child relationship, in which the parent content and all child content become a single virtual document; for example, e-mail with attachments

 

the content in the repository is linked to other content, in which all the linked content becomes a single virtual document; for example, a non-relational “network” database

 

Each of these models result in a single virtual document. The latter two examples are called compound documents, because the Verity engine can access each part of the content as a separate document in a collection.

These models are not mutually exclusive; for example, it is possible that an employee record in a personnel system could reference the current payroll record on another system—in which case, you may decide to treat each kind of record as separate virtual document, you might establish a parent-child relationship between the items in a single virtual document, or because of the one-to-one mapping between the employee record and the payroll record, you might establish a linked relationship within a single virtual document.

The following sections identify the most common kinds of tokens that you need to create a virtual document and describe how the tokens can be used to implement the models.

Common Token Types

The C header file vdk_strm.h defines the kinds of tokens and their structures. You typically will use a much smaller subset of the defined tokens. The following table identifies the most common tokens that you will use in your gateway:

:

 


Token Type

Description

VdkTokenType_Buffer

Buffer containing document content. Use one or more buffer tokens to deliver the document’s contents to the Verity engine.

VdkTokenType_ContentType

Content specifier. Use a content-type token to specify the format and character set of the content in the stream. For more information, see Bypassing Auto-Recognition.

VdkTokenType_Eof

End of stream. Use this token after a file-by-name token or after all other tokens in a stream have been sent to the Verity engine.

VdkTokenType_Field

Field. Use one or more field tokens to send external and internal fields to the Verity engine. External fields are those fields that only exist in the repository. Internal fields are those fields that are defined in the repository’s collection or those fields specified in copy statements.

VdkTokenType_FileByName

File, which is referenced by its full pathname. Use a file-by-name token whenever a document can be represented as one or more files.

VdkTokenType_NewDoc

New document. Use a new document token when you need to create a new virtual document by providing a key that the gateway uses to extract the new document.

VdkTokenType_Zone

Zone. Use pairs of zone tokens to identify zones in the collection. Each zone token has a flag that indicates whether it starts a zone or ends a zone.


For more information about tokens and their structures, see the VgwStreamGetTokenFnc function in Stream Interface

Monolithic Documents

Your gateway can create a monolithic single virtual document, which is a document consisting of any number of tokens. These tokens may be of any type; for example, your gateway driver’s VgwStreamGetTokenFnc function could build a stream that intersperses file-by-name tokens, buffer tokens, field tokens, or other tokens. In one scenario, your gateway could send tokens as follows:

content-type token for the format and field token for the character set; see Bypassing Auto-Recognition

 

zone token with the start-zone flag set

 

buffer tokens

 

zone token with the end-zone flag set

 

additional zone-buffer-zone sequences of tokens

 

field tokens

 

EOF token

 

Multiple File Documents

Your gateway can create a single virtual document from a set of files; for example, all files that make up pieces in a logical document, all files in an archive directory, and so on. The document consists of multiple file-by-name tokens and can contain other tokens as well. In one scenario, your gateway could send tokens as follows:

content-type token for the format and field token for the character set; see Bypassing Auto-Recognition

 

zone token with the start-zone flag set

 

file-by-name token

 

zone token with the end-zone flag set

 

EOF token

 

additional content type-zone-file-zone-EOF sequences of tokens

 

You must send an EOF token after every file-by-name token. Because the file-by-name token resets the content format and the character set, you must send a content-type token followed by a field token for the character set before each file-by-name token to bypass auto recognition of these characteristics. If you intersperse file-by-name tokens with other tokens, you must be careful to send, in the appropriate order, EOF, content-type tokens, and field tokens for the character set.

Compound Documents

Your gateway can create compound documents, which are virtual documents that are related to each other in some way, such as the parent of one or more child documents or documents that link to other documents.

Your gateway driver’s VgwStreamGetTokenFnc function informs the Verity engine that the function is about to stream a related document by sending a new document (VdkTokenType_NewDoc) token, which causes the Verity engine to create a new stream and to start retrieving tokens from the new stream.

You can set the VdkTokenNewDoc_Child behavior flag in the VdkTokenType_NewDoc token to specify a “child” relationship from the new document to the current document, which becomes the “parent.” The Verity engine maintains this relationship and deletes child documents when the parent document is deleted.

If the flag is not set, the Verity engine does not maintain an explicit relationship between documents. A viewing application can respond to a VdkTokenType_NewDoc token by associating with the preceding document and, thus, represent the documents as linked when they are viewed.

You must be very careful to avoid creating a circular reference when creating compound documents with a linked relationship. The existence of a circular reference could set several streams working on the same sets of tokens, with unpredictable results. Further information is beyond the scope of this book; you should consult a computer science text book on directed acyclic graphs (DACs) for techniques to avoid circular references.

In one scenario, representing a parent document with an attachment, your gateway could send tokens as follows:

content-type tokens for the format and character set; see Bypassing Auto-Recognition

 

zone token with the start-zone flag set

 

buffer tokens

 

zone token with the end-zone flag set

 

additional zone-buffer-zone sequences of tokens

 

New document token with child flag set

 

field tokens

 

EOF token

 

The new document token would cause the Verity engine to call your gateway driver to create a new stream and to obtain tokens for the new document. In this scenario, your gateway could send tokens for the new document as follows:

content-type token for the format and field token for the character set

 

zone token with the start-zone flag set

 

file-by-name token

 

zone token with the end-zone flag set

 

EOF token

 

content-type token for the format and field token for the character set

 

field tokens

 

EOF token

 

In the above scenario, the new document stream (file-by-name token) is linked to the first stream (of buffer tokens). For information on the use of content-type and EOF tokens, see Multiple File Documents.

The kind of relationship (parent-child or link) between documents in important for the following reasons:

it controls the inheritance of fields between the two documents

 

it affects whether fields may be overridden

 

Field Inheritance

If the compound document has a link relationship (the VdkTokenNewDoc_Child behavior flag is not set), no fields are inherited. If the compound document has a parent-child relationship (the VdkTokenNewDoc_Child behavior flag is set), all fields from the parent document are inherited by the child document except for the following fields:

MIME-Type

 

Charset

 

Ext

 

Size

 

Your gateway driver’s VgwStreamGetTokenFnc function can send field tokens for fields that have not been inherited when streaming the new document. If you do not want to inherit fields, you must not set the VdkTokenNewDoc_Child behavior flag.

Field Overrides

When the Verity engine writes a field to persistent store, it checks whether the “no field override” flag is set in the token. The “no field override” flag is one of the flags that can be set in the tokFlags member of the VdkTokenRec structure. For more information on this structure, see the VgwStreamGetTokenFnc function described in Stream Interface. The following decision table shows the action taken by the Verity engine:

 


 

Field value already set in persistent store

Field value not yet set in persistent store

“No field override” flag

set in token

No action. Field’s value is not replaced in persistent store

Field’s value is set in persistent store

“No field override” flag

not set in token

Field’s value is replaced in persistent store

Field’s value is set in persistent store


The “no field override” flag affects only fields whose values have already been written to persistent store; if the flag is set, the value is not replaced. This flag indicates that the value of the field in the token should not be considered better than an existing value. Although the “no field override” flag can be set by a gateway, the flag is most useful when used by a filter, because the filter may have to “guess” a value for a field. Verity-supplied filters set the “no field override” flag to defer to the gateway.

For example, a filter may attempt to create a title for a document based on the first few words in the first sentence of the document. If the document title has been set by the gateway, the filter can use this flag to defer to the title as specified by the gateway.

An issue arises for parent-child compound documents because fields of a parent document are inherited by a child document. Because Verity-supplied filters set the “no field override” flag, these filters do not replace values of inherited fields. You should consider this issue because, for example, you may not want both the parent and child documents to have the same values for the Author or Title fields and not be able to change them.

Your gateway should not send field tokens for fields in the parent document that you want to override in a child document. In the case of the Author and Title fields, Verity templates support AltAuthor and AltTitle fields. These fields can be used in place of the Author and Title fields. If your gateway sets values for the AltAuthor and AltTitle fields for the parent document, the gateway or a Verity-supplied filter can set values for the Author and Title fields in child documents. You can extend this technique for other fields as well.

Compound Document Keys

Your new document (VdkTokenType_NewDoc) token can specify the current document’s key as part of the key for the new document, which helps to identify the documents as compound documents. For more information about keys, see Creating Document Keys.