Convert PDF Files

Export has special configuration options that allow greater control over the conversion of PDF files. These options can improve the fidelity and accuracy of the HTML output.

Use a Graphic-based Reader

Two graphic-based PDF readers are available. The readers display PDFs by converting each page of the PDF to an image. If you do not want to redistribute the Acrobat Reader with your application, you can use a graphic-based reader instead.

The two readers support different features. Choose the appropriate reader depending on your requirements:

Use the kppdfrdr Reader

The kppdfrdr graphic-based reader has the following features:

The kppdfrdr reader has the following limitations:

Use the kppdf2rdr Reader

The kppdf2rdr graphic-based reader produces high-fidelity raster images. However, it has the following limitations:

Specify the Graphic-based Reader

By default, the Acrobat control is used to convert PDF documents. Use the following procedure to specify that one of the graphic-based readers be used to convert PDF documents.

To specify the graphic-based reader

  1. Open the formats_e.ini file with a text editor. The file is installed in the root of the Windows directory.

  2. In the [HiFi] section, set the following parameter to the graphic-based reader you want to use. Set one of the following values:

  1. Set CFG_SETHIFIPDF field in the HtmlExport class.

Convert PDF Files to Raster Images

Export allows you to convert each page of a PDF document to a raster image, providing a high-fidelity conversion of the document.

The output format depends on the value of setOutputRasterGraphicType in HtmlOptionInfo.

On UNIX and Linux, the conversion of PDFs to JPEG uses the Java program kvraster.class. This Java program requires some setup. See Display Vector Graphics on UNIX and Linux.

To specify the graphic-based reader for converting PDF documents

  1. Specify the graphic-based reader you want to use.

  2. Create an instance of the ConfigOption class. Set the OptionType argument to CFG_SETHIFIPDF, and the OptionValue argument to 1.

  3. Call the setConfigOption method and pass the ConfigOption object.

  4. Call a convert method. See the Javadoc in the directory install\javaapi\javadoc, where install is the path name of the Export installation directory.

The HtmlConvFileToFile sample program demonstrates how to use the setConfigOption() method. See HtmlConvFileToFile.

Convert PDF Files to a Logical Reading Order

The PDF format is primarily designed for presentation and printing of brochures, magazines, forms, reports, and other materials with complex visual designs. Most PDF files do not contain the logical structure of the original document—the correct reading order, for example, and the presence and meaning of significant elements such as headers, footers, columns, tables, and so on.

KeyView can convert a PDF file either by using the file’s internal unstructured paragraph flow, or by applying a structure to the paragraphs to reproduce the logical reading order of the visual page. Logical reading order enables HPE KeyView to output PDF files that contain languages that read from right-to-left (such as Hebrew and Arabic) in the correct reading direction.

NOTE: Note: The algorithm used to reproduce the reading order of a PDF page is based on common page layouts. The paragraph flow generated for PDFs with unique or complex page designs might not emulate the original reading order exactly.

For example, page design elements such as drop capitals, callouts that cross column boundaries, and significant changes in font size might disrupt the logical flow of the output text.

Logical Reading Order and Paragraph Direction

By default, HPE KeyView produces an unstructured text stream for PDF files. This means that PDF paragraphs are extracted in the order in which they are stored in the file, not the order in which they appear on the visual page. For example, a three-column article could be output with the headers and the title at the end of the output file, and the second column extracted before the first column. Although this output does not represent a logical reading order, it accurately reflects the internal structure of the PDF.

You can configure HPE KeyView to produce a structured text stream that flows in a specified direction. This means that PDF paragraphs are extracted in the order (logical reading order) and direction (left-to-right or right-to-left) in which they appear on the page.

The following paragraph direction options are available.

Paragraph Direction Option

Description

Left-to-right

Paragraphs flow logically and read from left to right. You should specify this option when most of your documents are in a language that uses a left-to-right reading order, such as English or German.

Right-to-left

Paragraphs flow logically and read from right to left. You should specify this option when most of your documents are in a language that uses a right-to-left reading order, such as Hebrew or Arabic.

Dynamic

Paragraphs flow logically. The PDF reader determines the paragraph direction for each PDF page, and then sets the direction accordingly. This option is used when a paragraph direction is not specified.

NOTE: Note: Conversions might be slower when logical reading order is enabled. For optimal speed, use an unstructured paragraph flow.

The paragraph direction options control the direction of paragraphs on a page; they do not control the text direction in a paragraph. For example, a PDF file might contain English paragraphs in three columns that read from left to right, but 80% of the second paragraph contains Hebrew characters. If you enable the left-to-right logical reading order, the paragraphs are ordered logically in the output—title paragraph, then paragraph 1, 2, 3, and so on—and flow from the top left of the first column to the bottom right of the third column. However, the text direction of the second paragraph is determined independently of the page by the PDF reader, and is output from right to left.

NOTE: Note: Extraction of metadata is not affected by the paragraph direction setting. The characters and words in metadata fields are extracted in the correct reading direction regardless of whether logical reading order is enabled.

Enable Logical Reading Order

You can enable logical reading order by using either the API or the formats_e.ini file. Setting the direction in the API overrides the setting in the formats_e.ini file.

Use the Java API

To enable PDF logical reading order in the Java API

  1. Use the setPDFLogicalOrder(int orderFlag) method of the HtmlExport object, and set the orderFlag argument to one of the following flags.

    Flag

    Description

    PDF_LOGICAL_ORDER_LTR

    Logical reading order and left-to-right paragraph direction

    PDF_LOGICAL_ORDER_RTL

    Logical reading order and right-to-left paragraph direction

    PDF_LOGICAL_ORDER_AUTO

    Logical reading order. The PDF reader determines the paragraph direction for each PDF page, and then sets the direction accordingly. This option is used when a paragraph direction is not specified.

    PDF_LOGICAL_ORDER_RAW

    Unstructured paragraph flow. This is the default behavior. Set this flag if logical reading order is enabled, and you want to return to an unstructured paragraph flow.

For example,

objHTMLExport.setPDFLogicalOrder(Export.PDF_LOGICAL_ORDER_RTL);

Use the formats_e.ini File

The formats_e.ini file is in the install\OS\bin directory, where install is the path name of the Export installation directory and OS is the name of the operating system.

To enable logical reading order by using the formats_e.ini file

  1. Change the PDF reader entry in the [Formats] section of the formats_e.ini file as follows:

    [Formats]
    200=lpdf
  2. Optionally, add the following section to the end of the formats_e.ini file:

    [pdf_flags]
    pdf_direction=paragraph_direction

    where paragraph_direction is one of the following.

    Flag

    Description

    LPDF_LTR

    Left-to-right paragraph direction

    LPDF_RTL

    Right-to-left paragraph direction

    LPDF_AUTO

    The PDF reader determines the paragraph direction for each PDF page, and then sets the direction accordingly. This option is used when a paragraph direction is not specified.

    LPDF_RAW

    Unstructured paragraph flow. This is the default behavior. Set this flag if logical reading order is enabled, and you want to return to an unstructured paragraph flow.

Generate a Table of Contents from PDF Bookmarks

When you use the basic reader (pdfsr) to convert PDF files to HTML, the table of contents is generated from "bookmarks" within the PDF file. The hyperlinked table of contents can appear either at the beginning of the HTML file or in a separate frame.

HPE recommends that you configure the conversion so that the table of contents appears in a separate frame (the pdfframe.ini template demonstrates how to do this). Export uses absolute positioning when converting a PDF file, that is, the text appears in the exact position as in the original document. Table of contents entries do not contain absolute positioning information. Therefore, if the main document and the table of contents are generated in the same output file, the table of contents entries might overlap the body text in the document.

NOTE: Note: When PDF bookmarks are converted to a table of contents in HTML, the generated links do not lead to the exact location of the destination marker, but jump to the page on which the destination marker exists. This is similar to the behavior of the Adobe Acrobat Reader.

Disable Bookmark Conversion

By default, Export converts PDF bookmarks to a table of contents in the HTML output. However, you can configure Export not to generate a table of contents based on the PDF bookmarks.

To specify that PDF bookmarks are not converted and included in the HTML output

  1. Create an instance of the ConfigOption class. Set the OptionType argument to CFG_SUPPRESSTOCPRINTIMAGE, and the OptionValue argument to 1.

  2. Call the setConfigOption method and pass the ConfigOption object.

  3. Call a convert method. See the Javadoc in the directory install\javaapi\javadoc, where install is the path name of the Export installation directory.

    NOTE: Note: A table of contents is not generated when a PDF file does not contain bookmarks, or when CFG_SUPPRESSTOCPRINTIMAGE is set.

Convert Invisible Text

PDF documents sometimes contain invisible text, which you can search in Adobe PDF Reader but cannot view in a web browser.

Toggle Invisible Text

You can add a JavaScript button to the upper right corner of the exported page, which you can click to toggle between invisible and regular text. When you turn on invisible text, the invisible text is displayed and the regular content is hidden; when you turn off invisible text, the invisible text is hidden.

Invisible text is hidden by default. The toggle button appears only if invisible text is detected in the PDF document.

To add an invisible text toggle button

Specify Opacity of Invisible Text

Invisible text often occurs in PDF documents when the PDF software processes rasterized images through optical character recognition and then inserts the text in the PDF. You might want to display both the invisible text as well as the rasterized image. To do so, you can set the invisible text opacity as determined by an integer from 0 to 100, where 0 hides the invisible text and 100 displays it fully.

Invisible text opacity is set to 0 by default.

To set invisible text opacity

Convert Rotated Text

By default, rotated text is displayed in its original position, at the original font size, and at 0 degrees rotation in the HTML output. The text is not rotated in the HTML output because text rotation is not supported by HTML.

Because the text is the original size, but might be displayed in a smaller space (at 0 degrees), the text might overlap adjacent text in the HTML output. To avoid this problem, you can specify that the rotated text be removed from its original position and displayed at the bottom of the HTML page on which it appears.

[[FUTURE: This option only applies to PDF. It will support other formats in future releases.]]

To specify that rotated text be displayed at the bottom of the HTML page

  1. Create an instance of the ConfigOption class. Set the OptionType argument to CFG_SETTEXTROTATE, and the OptionValue argument to 1.
  2. Call the setConfigOption method and pass the ConfigOption object.
  3. Call a convert method. See the Javadoc in the directory install\javaapi\javadoc, where install is the path name of the Export installation directory.

    NOTE: Note: When this feature is enabled, white space is added to the bottom of every HTML page to accommodate any rotated text.

Control Hyphenation

There are two types of hyphens in a PDF document:

By default, HPE KeyView maintains the source document’s soft hyphens in the output HTML to more accurately represent the layout of the source document. However, if you are using Export to generate text output for an indexing engine, or if you are not concerned with maintaining the layout of the document, HPE recommends that you remove soft hyphens from the HTML output. To remove soft hyphens, you must enable the soft hyphen flag.

NOTE: Note: If the soft hyphen flag is enabled, every hyphen at the end of a line is considered a soft hyphen and removed from the HTML output. Hard hyphens at the end of a line are also removed. This might result in an intentionally hyphenated word being extracted without a hyphen.

To remove soft hyphens from the HTML output

  1. Create an instance of the ConfigOption class. Set the OptionType argument to CFG_DELSOFTHYPHEN and the OptionValue argument to 1.
  2. Call the setConfigOption method and pass the ConfigOption object.
  3. Call a convert method. See the Javadoc in the directory install\javaapi\javadoc, where install is the path name of the Export installation directory.

Improve Performance for PDFs with Many Small Images

To improve performance when converting PDF files that contain many small pixel images, you can specify the minimum pixel height and width for images that are converted to JPEG in the formats_e.ini file. If an image is smaller than the minimum height and width, HPE KeyView does not generate a JPEG file for the image.

For example, to specify that images 16 pixels in height and width and less are not converted, add the following to the [pdf_flags] section of the formats_e.ini file:

[pdf_flags]
process_images_with_min_height=17
process_images_with_min_width=17

Extract Custom Metadata from PDF Files

To extract custom metadata from your PDF files, add the custom metadata names to the pdfsr.ini file provided, and copy the modified file to the \bin directory. You can then extract metadata as you normally would.

The pdfsr.ini is in the samples\pdfini directory, and has the following structure:

<META>
<TOTAL>total_item_number</TOTAL>,
/metadata_tag_name datatype,
</META>

Parameter

Description

total item number

The total number of metadata tags that are listed.

metadata_tag_name

The metadata tag name used in the PDF files.

datatype

The data type of the metadata field. The possible types are:

  • KV_String
  • KV_Int4
  • KV_DateTime
  • KV_ClipBoard
  • KV_Bool
  • KV_Unicode
  • KV_IEEE8
  • KV_Other

For example:

<META>
<TOTAL> 4 </TOTAL>
/part_number     INT4
/volume          INT4
/purchase_date   DATETIME
/customer        STRING
</META>
NOTE: Note: Metadata cannot be extracted from PDFs when the PDF is converted to JPEG. See Convert PDF Files to Raster Images.

_HP_HTML5_bannerTitle.htm