Writing a document input handler

Terminology

The small programs or pieces of code which handle the processing of the different document types for Recoll used to be called filters, which is still reflected in the name of the directory which holds them and many configuration variables. They were named this way because one of their primary functions is to filter out the formatting directives and keep the text content. However these modules may have other behaviours, and the term input handler is now progressively substituted in the documentation. filter is still used in many places though.

Recoll input handlers cooperate to translate from the multitude of input document formats, simple ones as opendocument, acrobat, or compound ones such as Zip or Email, into the final Recoll indexing input format, which is plain text (in many cases the processing pipeline has an intermediary HTML step, which may be used for better previewing presentation). Most input handlers are executable programs or scripts. A few handlers are coded in C++ and live inside recollindex. This latter kind will not be described here.

There are two kinds of external executable input handlers:

  • Simple exec handlers run once and exit. They can be bare programs like antiword, or scripts using other programs. They are very simple to write, because they just need to print the converted document to the standard output. Their output can be plain text or HTML. HTML is usually preferred because it can store metadata fields and it allows preserving some of the formatting for the GUI preview. However, these handlers have limitations:

    • They can only process one document per file.

    • The output MIME type must be known and fixed.

    • The character encoding, if relevant, must be known and fixed (or possibly just depending on location).

  • Multiple execm handlers can process multiple files (sparing the process startup time which can be very significant), or multiple documents per file (e.g.: for archives or multi-chapter publications). They communicate with the indexer through a simple protocol, but are nevertheless a bit more complicated than the older kind. Most of the new handlers are written in Python (exception: rclimg which is written in Perl because exiftool has no real Python equivalent). The Python handlers use common modules to factor out the boilerplate, which can make them very simple in favorable cases. The subdocuments output by these handlers can be directly indexable (text or HTML), or they can be other simple or compound documents that will need to be processed by another handler.

In both cases, handlers deal with regular file system files, and can process either a single document, or a linear list of documents in each file. Recoll is responsible for performing up to date checks, deal with more complex embedding and other upper level issues.

A simple handler returning a document in text/plain format, can transfer no metadata to the indexer. Generic metadata, like document size or modification date, will be gathered and stored by the indexer.

Handlers that produce text/html format can return an arbitrary amount of metadata inside HTML meta tags. These will be processed according to the directives found in the fields configuration file.

The handlers that can handle multiple documents per file return a single piece of data to identify each document inside the file. This piece of data, called an ipath will be sent back by Recoll to extract the document at query time, for previewing, or for creating a temporary file to be opened by a viewer. These handlers can also return metadata either as HTML meta tags, or as named data through the communication protocol.

The following section describes the simple handlers, and the next one gives a few explanations about the execm ones. You could conceivably write a simple handler with only the elements in the manual. This will not be the case for the other ones, for which you will have to look at the code.