Extract Metadata

When a file format supports metadata, HPE KeyView can extract and process that information. Metadata includes document information fields such as title, author, creation date, and file size. Depending on the file’s format, metadata is referred to in a number of ways: for example, “summary information,” “OLE summary information,” “file information,” and “document properties.”

The metadata in mail formats (MSG and EML) and mail stores (PST, NSF, and MBX) is extracted differently than other formats. For information on extracting metadata from these formats, see Extract Mail Metadata.

NOTE: Note: HPE KeyView can extract metadata from a document onlyif metadata is defined in the document, and the document reader can extract metadata for the file format. The section Supported Formats lists the file formats for which metadata can be extracted. HPE KeyView does not generate metadata automatically from the document contents.

Extract Metadata Using the API

You can extract the metadata at the API level. The API extracts all valid metadata fields that exists in the file.

To extract metadata using the Java API

  1. Set the input source using the setInputSource method.

  2. Call the getSummaryInfo() method of the Export object to retrieve an object of the SummaryInfo class.

  3. Use the methods of the SummaryInfo object to retrieve the metadata information.

    The HtmlTest sample program demonstrates how to extract metadata through the Java API. SeeHtmlTest.

Example

SummaryInfo[] sinfo = objHtmlExport.getSummaryInfo();
if(sinfo != null)
{
  System.out.println("\nSummary info has been extracted.");
  fos_sum = new FileOutputStream(summaryOutFile);
  DataOutputStream dos_sum = new DataOutputStream(fos_sum);
  for(int i=0; i<sinfo.length; i++)
  {
    if(sinfo[i].getElementName() != null)
    {
      dos_sum.writeBytes("Element name: " + sinfo[i].getElementName() + "\n");
      dos_sum.writeBytes("Element type: " + sinfo[i].getSumInfoType() + "\n");
      if(sinfo[i].getIsValid() == true)
      {
         if(sinfo[i].isDateTimeType())
        {
           dos_sum.writeBytes("Date/time: "); 
           dos_sum.writeBytes(sinfo[i].getDateTime())
        }
         else
        {
           byte[] data = sinfo[i].getData();
           if(data != null)
          {
             dos_sum.writeBytes("Element data: "); 
             dos_sum.write(data);
          }
        }
      }
       dos_sum.writeBytes("\n\n");
    }
  }
   dos_sum.close();
   fos_sum.close();
}
sinfo = null;

The SummaryInfo class stores the metadata extraction results. After calling the HtmlExport.getSummaryInfo() method, call the get methods provided by each instance of this class to extract metadata.

The following describes each get method:

getElementName()

This method gets the name of the metadata element.

getSumInfoType()

This method specifies the data type of the metadata element. The possible types are:

  • KV_String
  • KV_Int4
  • KV_DateTime
  • KV_ClipBoard
  • KV_Bool
  • KV_Unicode
  • KV_IEEE8
  • KV_Other

If type is KV_Bool, data contains either TRUE or FALSE.

KV_DateTime and KV_IEEE8 point to an 8-byte value.

getIsValid()

This method specifies whether the data value is present in the document. TRUE specifies that the value is valid. For example, if the “Title” element was not populated in the document, getIsValid would return FALSE.

isDateTimeType()

This method determines whether the metadata element is of date/time type.

getDateTime()

This method gets the date and time in the form of a string. If the metadata element is of date/time type, call this method to get the date and time in the form of a string, for example “Wed Jun 30 21:49:08 1993” or “135 Minutes”.

getData()

This method gets the content of the element.

If the metadata field is a date and time, the type is a 64-bit value representing the number of 100-nanosecond intervals since January 1, 1601.

You can also use the isDateTimeType() method to determine whether a metadata element is of date/time type, and then use the getDateTime() method to obtain the date/time in the form of a string.

Extract Metadata Using a Template File

When using a template file, HPE KeyView recognizes two types of metadata: standard and non-standard. Standard metadata includes fields, such as Title, Author, and Subject. The standard fields are enumerated from 1 to 41 in KVSumType in the install\htmlexport\include\kvtypes.h header file. Non-standard metadata includes any field not listed from 1 to 41 in KVSumType, such as user-defined fields (for example, custom property fields in Microsoft Word documents), or fields that are unique to a particular file type (for example, “Artist” or “Genre” fields in MP3 files). Enumerated types 42 and greater are reserved for non-standard metadata.

To extract metadata using a template file

  1. Insert metadata tokens in a member of the KVHTMLTemplateEx section of the template file. This defines the point at which the metadata appears in the HTML output.

  2. If you are using the $USERSUMMARY or $SUMMARY token, define the szUserSummary member of the KVHTMLTemplateEx section of the template file. This determines the markup and tokens generated when these metadata tokens are processed.

You can use the following metadata tokens in the template files:

Token

Description

$SUMMARYNN

This token inserts the data from a specified metadata field. NN is a number from 00 through 42 enumerated in KVSumType in kvtypes.h.

$SUMMARY

This token inserts the data from valid metadata fields in the range of 0 to 42 using the markup provided in pszUserSummary.

$USERSUMMARY

This token inserts the data from every valid non-standard metadata field using the markup provided in pszUserSummary.

$CONTENT

This token inserts the content of the metadata field specified by the $NAME token.

$NAME

This token inserts the name of a the metadata field, such as “Title,” “Author,” or “Subject.”

Depending on the mark-up in szUserSummary, the extracted metadata might not appear in the browser when the HTML file is displayed, but might appear in the output file. Most of the HPE KeyView-supplied template files extract standard metadata from a document, and include it in the output HTML. However, they do not display the metadata in a browser.

Examples

$SUMMARYNN

The following markup displays the contents of the “Title” field at the top of the main HTML file:

szMainTop=<em><strong>$SUMMARY01</strong></em>

In KVSumType, 01 is the enumerated value for the “Title” metadata field.

$SUMMARY

The following markup extracts all standard fields, and includes them in the first heading level 1 HTML block:

szFirstH1Start=$SUMMARY
szUserSummary=<meta name="$NAME" content="$CONTENT" />

This example extracts the field name ($NAME) and field content ($CONTENT) for standard metadata from a document, and includes it at the beginning of the first heading level 1 HTML block. However, it does not display the metadata in the browser. The HTML output might look like this:

<meta name="CodePage" content="1252" />
<meta name="Title" content="My design document" />
<meta name="Subject" content="design specifications" />
<meta name="Author" content="John Doe" />
<meta name="Keywords" content="" />
<meta name="Comments" content="" />
<meta name="Template" content="Normal.dot" />
<meta name="LastAuthor" content="lchapman" />
<meta name="RevNumber" content="6" />
<meta name="EditTime" content="01/01/1601, 0:08" />
<meta name="LastPrinted" content="14/01/2002, 14:06" />
<meta name="Create_DTM" content="27/08/2003, 10:31" />
<meta name="LastSave_DTM" content="29/08/2003, 14:07" />
<meta name="PageCount" content="1" />
<meta name="WordCount" content="4062" />
<meta name="CharCount" content="23159" />
<meta name="AppName" content="Microsoft Word 9.0" />
<meta name="Security" content="0" />
<meta name="Category" content="software" />
<meta name="LineCount" content="192" />
<meta name="ParCount" content="46" />
<meta name="Manager" content="" />
<meta name="Company" content="Autonomy" />

To display the metadata in a browser, use the following markup in szUserSummary:

<hr>name="$NAME" content="$CONTENT"<br>

$USERSUMMARY

The following markup extracts non-standard fields, and includes them at the bottom of the main HTML file:

szMainBottom=$USERSUMMARY
szUserSummary=<meta name="$NAME" content="$CONTENT" />

This example extracts the field name ($NAME) and field content ($CONTENT) for non-standard metadata from a document, and includes it at the bottom of the main HTML file. However, it does not display the metadata in the browser. The HTML output might look like this:

meta name="Telephone number" content="444-111-2222"
meta name="Recorded date" content="07/03/2003, 23:00"
meta name="Source" content="TRUE"
meta name="my property" content="reserved"

To display the metadata in a browser, use the following markup in szUserSummary:

<hr>name="$NAME" content="$CONTENT"<br>

_HP_HTML5_bannerTitle.htm