OCR for PDF documents

It must be noted that, if modifying the files (or a copy) is acceptable, then using OCRmyPDF to add a text layer to the PDF itself is a better solution than using the Recoll OCR feature: e.g. allowing Recoll to position the PDF viewer on the search target when opening the document, and permitting secondary search in the native tool.

The Recoll OCR is enabled by the pdfocr configuration variable, and will only be executed if the processed file has no text content.

Example configuration fragment in recoll.conf:

pdfocr = 1
ocrprogs = tesseract
tesseractlang = eng

The pdfocr variable can be set globally or for specific subtrees.