Running OCR on image documents

The Recoll PDF handler (rclpdf.py), and the alternate image handler (rclimg.py) have the ability to call an external OCR program (only as of Recoll 1.43.3 for the latter).

The operation details are slightly different for PDF and other image documents.

To enable the Recoll OCR feature, you need to install one of the supported OCR applications (tesseract or ABBYY), enable OCR in the PDF or image handler by setting the appropriate configuration parameter, tell Recoll how to run the OCR by setting the specific OCR configuration variables. All parameters can be localized in subdirectories through the usual main configuration mechanism (path sections).

This facility got a major update in Recoll 1.26.5. Older versions had a more limited, non-caching capability to execute an external OCR program in the PDF handler. The new function has the following features:

  • The OCR output is cached, stored as separate files. The caching is ultimately based on a hash value of the original file contents, so that it is immune to file renames. A first path-based layer ensures fast operation for unchanged (unmoved files), and the data hash (which is still orders of magnitude faster than OCR) is only re-computed if the file has moved. OCR is only performed if the file was not previously processed or if it changed.

  • The support for a specific OCR program is implemented in a simple Python module. It should be straightforward to add support for any OCR engine with a capability to run from the command line.

  • Modules initially exist for tesseract (Linux and Windows), and ABBYY FineReader (Linux, tested with version 11). ABBYY FineReader is a commercial closed source program, but it sometimes perform better than tesseract.