Terminology
The small programs or pieces of code which handle the
processing of the different document types for Recoll used to be
called filters, which is still reflected in the name of the directory which
holds them and many configuration variables. They were named this way because one of their
primary functions is to filter out the formatting directives and keep the text
content. However these modules may have other behaviours, and the term input
handler is now progressively substituted in the
documentation. filter is still used in many places though.
Recoll input handlers cooperate to translate from the multitude of input document formats, simple ones as opendocument, acrobat, or compound ones such as Zip or Email, into the final Recoll indexing input format, which is plain text (in many cases the processing pipeline has an intermediary HTML step, which may be used for better previewing presentation). Most input handlers are executable programs or scripts. A few handlers are coded in C++ and live inside recollindex. This latter kind will not be described here.
There are two kinds of external executable input handlers:
Simple
exechandlers run once and exit. They can be bare programs like antiword, or scripts using other programs. They are very simple to write, because they just need to print the converted document to the standard output. Their output can be plain text or HTML. HTML is usually preferred because it can store metadata fields and it allows preserving some of the formatting for the GUI preview. However, these handlers have limitations:They can only process one document per file.
The output MIME type must be known and fixed.
For handlers producing
text/plain, the character encoding must be known and fixed (or possibly just depending on location).
Multiple
execmhandlers can process multiple files (sparing the process startup time which can be very significant), or multiple documents per file (e.g.: for archives or multi-chapter publications). They communicate with the indexer through a simple protocol, but are nevertheless a bit more complicated than the older kind. Most of the new handlers are written in Python (exception: rclimg which is written in Perl becauseexiftoolhas no real Python equivalent). The Python handlers use common modules to factor out the boilerplate, which can make them very simple in favorable cases. The subdocuments output by these handlers can be directly indexable (text or HTML), or they can be other simple or compound documents that will need to be processed by another handler.
In both cases, handlers deal with regular file system files, and can process either a single document, or a linear list of documents in each file. Recoll is responsible for performing up to date checks, deal with more complex embedding, temporary files, and other upper level issues.
A simple handler returning a document in text/plain format, can
transfer no metadata to the indexer. Generic metadata, like document size or modification
date, will be gathered and stored by the indexer.
Handlers that produce text/html format can return an arbitrary amount
of metadata inside HTML meta tags. These will be processed according to the
directives found in the fields
configuration file.
The handlers that can handle multiple documents per file return a single piece of data
to identify each document inside the file. This piece of data, called an
ipath will be sent back by Recoll to extract the document at query time, for
previewing, or for creating a temporary file to be opened by a viewer. These handlers can also
return metadata either as HTML meta tags, or as named data through the
communication protocol.
The following section describes the simple handlers, and the next one gives a few
explanations about the execm ones. You could conceivably write a simple
handler with only the elements in the manual. This will not be the case for the other ones,
for which you will have to look at the code.

