The PDF input handler

XMP fields extraction
PDF attachment indexing

The PDF format is very important for scientific and technical documentation, and document archival. It has extensive facilities for storing metadata along with the document, and these facilities are actually used in the real world.

In consequence, the rclpdf.py PDF input handler has more complex capabilities than most others, and it is also more configurable. Specifically, rclpdf.py has the following features:

It can be configured to extract specific metadata tags from an XMP packet.
It can extract PDF attachments.
It can automatically perform OCR if the document text is empty. This is done by executing an external program and is now described in a separate section, because the OCR framework can also be used with non-PDF image files.

Prev	Up	Next
	Home

Recoll user manualMiscellaneous indexing notes

The PDF input handler

Recoll user manual

Miscellaneous indexing notes