The PDF format is very important for scientific and technical documentation, and document archival. It has extensive facilities for storing metadata along with the document, and these facilities are actually used in the real world.
In consequence, the rclpdf.py PDF input handler has more complex capabilities than most others, and it is also more configurable, because some extra features need executing external commands, so that they are not enabled by default. Specifically, rclpdf.py has the following optional features:
It can extract PDF outlines and bookmarks.
It can be configured to extract specific metadata tags from an XMP packet.
It can extract PDF attachments.
It can automatically perform OCR if the document text is empty. This is done by executing an external program and is now described in a separate section, because the OCR framework can also be used with non-PDF image files.