Terminology
The small programs or pieces of code which handle the
processing of the different document types for Recoll used to be
called filters
, which is still reflected in the name of the directory which
holds them and many configuration variables. They were named this way because one of their
primary functions is to filter out the formatting directives and keep the text
content. However these modules may have other behaviours, and the term input
handler
is now progressively substituted in the
documentation. filter
is still used in many places though.
Recoll input handlers cooperate to translate from the multitude of input document formats, simple ones as opendocument, acrobat, or compound ones such as Zip or Email, into the final Recoll indexing input format, which is plain text (in many cases the processing pipeline has an intermediary HTML step, which may be used for better previewing presentation). Most input handlers are executable programs or scripts. A few handlers are coded in C++ and live inside recollindex. This latter kind will not be described here.
There are two kinds of external executable input handlers:
Simple
exec
handlers run once and exit. They can be bare programs like antiword, or scripts using other programs. They are very simple to write, because they just need to print the converted document to the standard output. Their output can be plain text or HTML. HTML is usually preferred because it can store metadata fields and it allows preserving some of the formatting for the GUI preview. However, these handlers have limitations:They can only process one document per file.
The output MIME type must be known and fixed.
The character encoding, if relevant, must be known and fixed (or possibly just depending on location).
Multiple
execm
handlers can process multiple files (sparing the process startup time which can be very significant), or multiple documents per file (e.g.: for archives or multi-chapter publications). They communicate with the indexer through a simple protocol, but are nevertheless a bit more complicated than the older kind. Most of the new handlers are written in Python (exception: rclimg which is written in Perl becauseexiftool
has no real Python equivalent). The Python handlers use common modules to factor out the boilerplate, which can make them very simple in favorable cases. The subdocuments output by these handlers can be directly indexable (text or HTML), or they can be other simple or compound documents that will need to be processed by another handler.
In both cases, handlers deal with regular file system files, and can process either a single document, or a linear list of documents in each file. Recoll is responsible for performing up to date checks, deal with more complex embedding and other upper level issues.
A simple handler returning a
document in text/plain
format, can transfer
no metadata to the indexer. Generic metadata, like document
size or modification date, will be gathered and stored by
the indexer.
Handlers that produce text/html
format can return an arbitrary amount of metadata inside HTML
meta
tags. These will be processed
according to the directives found in
the fields
configuration file.
The handlers that can handle multiple documents per file
return a single piece of data to identify each document inside
the file. This piece of data, called
an ipath
will be sent back by
Recoll to extract the document at query time, for previewing,
or for creating a temporary file to be opened by a
viewer. These handlers can also return metadata either as HTML
meta
tags, or as named data through the
communication protocol.
The following section describes the simple
handlers, and the next one gives a few explanations about
the execm
ones. You could conceivably
write a simple handler with only the elements in the
manual. This will not be the case for the other ones, for
which you will have to look at the code.