If you can program and want to write
an execm
handler, it should not be too
difficult to make sense of one of the existing handlers.
The existing handlers differ in the amount of helper code which they are using:
rclimg
is written in Perl and handles the execm protocol all by itself (showing how trivial it is).All the Python handlers share at least the
rclexecm.py
module, which handles the communication. Have a look at, for example,rclzip.py
for a handler which usesrclexecm.py
directly.Most Python handlers which process single-document files by executing another command are further abstracted by using the
rclexec1.py
module. See for examplerclrtf.py
for a simple one, orrcldoc.py
for a slightly more complicated one (possibly executing several commands).Handlers which extract text from an XML document by using an XSLT style sheet are now executed inside recollindex, with only the style sheet stored in the
filters/
directory. These can use a single style sheet (e.g.abiword.xsl
), or two sheets for the data and metadata (e.g.opendoc-body.xsl
andopendoc-meta.xsl
). Themimeconf
configuration file defines how the sheets are used, have a look. Before the C++ import, the xsl-based handlers used a common modulerclgenxslt.py
, it is still around but unused at the moment. The handler for OpenXML presentations is still the Python version because the format did not fit with what the C++ code does. It would be a good base for another similar issue.
There is a sample trivial handler based on
rclexecm.py
, with many comments, not actually
used by Recoll. It would index a text file as one document per
line. Look for rcltxtlines.py
in the
src/filters
directory in the online Recoll
Git
repository (the sample not in the distributed release at
the moment).
You can also have a look at the slightly more complex
rclzip.py which uses Zip
file paths as identifiers (ipath
).
execm
handlers sometimes need to make
a choice for the nature of the ipath
elements that they use in communication with the
indexer. Here are a few guidelines:
Use ASCII or UTF-8 (if the identifier is an integer print it, for example, like printf %d would do).
If at all possible, the data should make some kind of sense when printed to a log file to help with debugging.
Recoll uses a colon (
:
) as a separator to store a complex path internally (for deeper embedding). Colons inside theipath
elements output by a handler will be escaped, but would be a bad choice as a handler-specific separator (mostly, again, for debugging issues).
In any case, the main goal is that it should
be easy for the handler to extract the target document, given
the file name and the ipath
element.
execm
handlers will also produce
a document with a null ipath
element. Depending on the type of document, this may have
some associated data (e.g. the body of an email message), or
none (typical for an archive file). If it is empty, this
document will be useful anyway for some operations, as the
parent of the actual data documents.