Convert XML Files

Export enables you to extract all content or selected content from source XML files. It detects the following XML formats:

See File Format Detection for more information on format detection.

Configure Element Extraction for XML Documents

When converting XML files, you can specify which elements and attributes are extracted according to the file’s format ID or root element. This is useful when you want to extract only relevant text elements, such as abstracts from reports, or a list of authors from an anthology.

A root element is an element in which all other elements are contained. In the XML sample below, book is the root element:

<book>
  <title>XML Introduction</title>
  <product id="33-657" status="draft">XML Tutorial</product>
  <chapter>Introduction to XML
     <para>What is HTML</para>
     <para>What is XML</para>
  </chapter>
  <chapter>XML Syntax
     <para>Elements must have a closing tag</para>
     <para>Elements must be properly nested</para>
  </chapter>
</book>

For example, you could specify that when converting files with the root element book, the element title is extracted as metadata, and only product elements with a status attribute value of draft are extracted.

When you extract an element, the child elements within the element are also extracted. For example, if you extract the element chapter from the sample above, the child element para is also extracted.

Export defines default element extraction settings for the following XML formats:

These settings are defined internally and are used when converting these file formats; however, you can modify their values.

In addition to the default extraction settings, you can also add custom settings for your own XML document types. If you do not define custom settings for your own XML document types, the settings for the generic XML are used.

Modify Element Extraction Settings

You can modify configuration settings for XML documents through either the API or the kvxconfig.ini file.

NOTE: Note: You can use customized element extraction settings only when converting files in process. When converting out of process, the default extraction settings are used.

Use the Java API

You can use the Java API to modify the settings for the standard XML document types, or to add configuration settings for your own XML document types.

To modify settings

  1. Declare an array of XMLConfigSet objects.

  2. Create an instance of the ConfigOption class with the following arguments:

    1. Set the OptionType to CFG_SETXMLCONFIGINFO.

    2. Set the OptionValue to 0.

    3. Set OptionData to the array object.

  3. Call the setConfigOption method, and pass the ConfigOption object.

  4. Call a convert method. For example:

    XMLConfigSet[] XMLInfo;
    ConfigOption config=new ConfigOption(Export.CFG_SETXMLCONFIGINFO, 0, XMLInfo);
    objExport.setConfigOption(config);

Use an Initialization File

You can use the initialization file to modify the settings for the standard XML document types, or to add configuration settings for your own XML document types.

To modify settings

  1. Modify the kvxconfig.ini file.

  2. Use the template file when processing the XML file.

    The Java sample program (HtmlConvFileToFile) demonstrates how to use a template file during the conversion process. See HtmlConvFileToFile.

Modify Element Extraction Settings in the kvxconfig.ini File

The kvxconfig.ini file contains default element extraction settings for supported XML formats. The file is in the install\OS\bin directory, where install is the path name of the Export installation directory and OS is the name of the operating system.

For example, the following entry defines extraction settings for the Microsoft Visio 2003 XML format:

[config3]
eKVFormat=MS_Visio_XML_Fmt
szRoot=
szInMetaElement=DocumentProperties
szExMetaElement=PreviewPicture
szInContentElement=Text
szExContentElement=
szInAttribute=

The following options are available.

Configuration Option

Description

eKVFormat

The format ID as detected by the KeyView detection module. This option determines the file type to which these extraction settings apply. See File Format Detection for more information on format ID values.

If you are adding configuration settings for a custom XML document type, this option is not defined.

szRoot

The file’s root element. When the format ID is not defined, the root element is used to determine the file type to which these settings apply.

To further qualify the element, specify its namespace. See Specify an Element’s Namespace and Attribute.

szInMetaElement

The elements extracted from the file as metadata. All other elements are extracted as text.

Separate multiple entries with commas.To further qualify the element, specify its namespace, its attributes, or both. See Specify an Element’s Namespace and Attribute.

szExMetaElement

The child elements in the included metadata elements that are not extracted from the file as metadata. For example, the default extraction settings for the Visio XML format extract the DocumentProperties element as metadata. This element includes child elements such as Title, Subject, Author, Description, and so on. However, the child element PreviewPicture is defined in szExMetaElement because it is binary data and should not be extracted.

You cannot exclude any metadata elements from the output for StarOffice files. All metadata is extracted regardless of this setting.

Separate multiple entries with commas. To further qualify the element, specify its namespace, its attributes, or both. See Specify an Element’s Namespace and Attribute.

szInContentElement

The elements extracted from the file as content text. Enter an asterisk (*) to extract all elements including child elements.

Separate multiple entries with commas. To further qualify the element, specify its namespace, its attributes, or both. See Specify an Element’s Namespace and Attribute.

szExContentElement

The child elements in the included content elements that are not extracted from the file as content text.

Separate multiple entries with commas. To further qualify the element, specify its namespace, its attributes, or both. See Specify an Element’s Namespace and Attribute.

szInAttribute

The attribute values extracted from the file. If attributes are not defined here, attribute values are not extracted.

Enter the namespace (if used), element name, and attribute name in the following format:

namespace:elementname@attributename

For example:

hpe:division@name

Separate multiple entries with commas.

Specify an Element’s Namespace and Attribute

To further qualify an element, you can specify that the element exist in a certain namespace, that it contain a specific attribute, or both. To define the namespace and attribute of an element, enter the following:

ns_prefix:elemname@attribname=attribvalue

Attribute values that contain spaces must be enclosed in quotation marks.

For example, the following entry:

bg:language@id=xml

extracts a language element in the bg namespace that contains the id attribute name with the value of "xml". This entry extracts the following element from an XML file:

<bg:language id="xml">XML is a simple, flexible text format derived from SGML</bg:language>

but does not extract:

<bg:language id="sgml">SGML is a system for defining markup languages.</bg:language>

or

<adv:language id="xml">The namespace should be a Uniform Resource Identifier (URI).</adv:language>

Add Configuration Settings for Custom XML Document Types

You can define element extraction settings for custom XML document types by adding the settings to the kvxconfig.ini file. For example, for files containing the root element hpexml, you could add the following section to the end of the initialization file:

[config101]
eKVFormat=
szRoot=hpexml
szInMetaElement=dc:title,dc:meta@title,dc:meta@name=title
szExMetaElement=
szInContentElement=hpe:division@name=dev,hpe:division@name=export,p@style="Heading 1"
szExContentElement=
szInAttribute=hpe:division@name

The custom extraction settings must be preceded by a section heading named [configN], where N is an integer starting at 100 and increasing by 1 for each additional file type, as in[config100], [config101], [config102], and so on. The default extraction settings for the supported XML formats are numbered config0 to config99. Currently only 0 to 6 are used.

Because a custom XML document type is not recognized by the KeyView detection module, the format ID is not defined. The file type is identified by the file’s root element only.

If a custom XML document type is not defined in the kvxconfig.ini file or by the setConfigOption method, the default extraction settings for a generic XML document are used.


_HP_HTML5_bannerTitle.htm