"Multiple" handlers

If you can program and want to write an execm handler, it should not be too difficult to make sense of one of the existing handlers.

The existing handlers differ in the amount of helper code which they are using:

  • rclimg is written in Perl and handles the execm protocol all by itself (showing how trivial it is).

  • All the Python handlers share at least the rclexecm.py module, which handles the communication. Have a look at, for example, rclzip.py for a handler which uses rclexecm.py directly.

  • Most Python handlers which process single-document files by executing another command are further abstracted by using the rclexec1.py module. See for example rclrtf.py for a simple one, or rcldoc.py for a slightly more complicated one (possibly executing several commands).

  • Handlers which extract text from an XML document by using an XSLT style sheet are now executed inside recollindex, with only the style sheet stored in the filters/ directory. These can use a single style sheet (e.g. abiword.xsl), or two sheets for the data and metadata (e.g. opendoc-body.xsl and opendoc-meta.xsl). The mimeconf configuration file defines how the sheets are used, have a look. Before the C++ import, the xsl-based handlers used a common module rclgenxslt.py, it is still around but unused at the moment. The handler for OpenXML presentations is still the Python version because the format did not fit with what the C++ code does. It would be a good base for another similar issue.

There is a sample trivial handler based on rclexecm.py, with many comments, not actually used by Recoll. It would index a text file as one document per line. Look for rcltxtlines.py in the src/filters directory in the online Recoll Git repository (the sample not in the distributed release at the moment).

You can also have a look at the slightly more complex rclzip.py which uses Zip file paths as identifiers (ipath).

execm handlers sometimes need to make a choice for the nature of the ipath elements that they use in communication with the indexer. Here are a few guidelines:

  • Use ASCII or UTF-8 (if the identifier is an integer print it, for example, like printf %d would do).

  • If at all possible, the data should make some kind of sense when printed to a log file to help with debugging.

  • Recoll uses a colon (:) as a separator to store a complex path internally (for deeper embedding). Colons inside the ipath elements output by a handler will be escaped, but would be a bad choice as a handler-specific separator (mostly, again, for debugging issues).

In any case, the main goal is that it should be easy for the handler to extract the target document, given the file name and the ipath element.

execm handlers will also produce a document with a null ipath element. Depending on the type of document, this may have some associated data (e.g. the body of an email message), or none (typical for an archive file). If it is empty, this document will be useful anyway for some operations, as the parent of the actual data documents.