Running OCR on image documents

The Recoll PDF handler has the ability to call an external OCR program if the processed file has no text content. The OCR data is stored in a cache of separate files, avoiding any modification of the originals.

It must be noted that, if modifying the files (or a copy) is acceptable, then running something like OCRmyPDF to add a text layer to the PDF itself is a better solution (e.g. allowing Recoll to position the PDF viewer on the search target when opening the document, and permitting secondary search in the native tool).

To enable the Recoll OCR feature, you need to install one of the supported OCR applications (tesseract or ABBYY), enable OCR in the PDF handler (by setting pdfocr to 1 in the index configuration file), and tell Recoll how to run the OCR by setting configuration variables. All parameters can be localized in subdirectories through the usual main configuration mechanism (path sections).

Example configuration fragment in recoll.conf:

pdfocr = 1
ocrprogs = tesseract
pdfocrlang = eng

This facility got a major update in Recoll 1.26.5. Older versions had a more limited, non-caching capability to execute an external OCR program in the PDF handler. The new function has the following features:

  • The OCR output is cached, stored as separate files. The caching is ultimately based on a hash value of the original file contents, so that it is immune to file renames. A first path-based layer ensures fast operation for unchanged (unmoved files), and the data hash (which is still orders of magnitude faster than OCR) is only re-computed if the file has moved. OCR is only performed if the file was not previously processed or if it changed.

  • The support for a specific program is implemented in a simple Python module. It should be straightforward to add support for any OCR engine with a capability to run from the command line.

  • Modules initially exist for tesseract (Linux and Windows), and ABBYY FineReader (Linux, tested with version 11). ABBYY FineReader is a commercial closed source program, but it sometimes perform better than tesseract.

  • The OCR is currently only called from the PDF handler, but there should be no problem using it for other image types.