The PDF input handler

The PDF format is very important for scientific and technical documentation, and document archival. It has extensive facilities for storing metadata along with the document, and these facilities are actually used in the real world.

In consequence, the rclpdf.py PDF input handler has more complex capabilities than most others, and it is also more configurable. Specifically, rclpdf.py has the following features:

  • It can be configured to extract specific metadata tags from an XMP packet.

  • It can extract PDF attachments.

  • It can automatically perform OCR if the document text is empty. This is done by executing an external program and is now described in a separate section, because the OCR framework can also be used with non-PDF image files.