Recoll external indexers

The Recoll indexer is capable of processing many different document formats. However, some forms of data storage do not lend themselves easily to standard processing because of their great variability. A canonical example would be data in an SQL database. While it might be possible to create a configurable filter to process data from a database, the many variations in storage organisation and SQL dialects make this difficult.

Recoll can instead support external indexers where all the responsibility to handle the data format is delegated to an external script. The script language has to be Python 3 at the moment, because this is the only language for which an API binding exists.

Up to Recoll 1.35, such an indexer had to work on a separate Recoll index, which would be added as an external index for querying from the main one, and for which a separate indexing schedule had to be managed. The reason was that the main document indexer purge pass (removal of deleted documents) would also remove all the documents belonging to the external indexer, as they were not seen during the filesystem walk (and conversely, the external indexer purge pass would delete all the regular document entries).

As of Recoll 1.36, an improvement and new API call allows external indexers to be fully integrated, and work on the main index, with updates triggered from the normal recollindex program.

An external indexer has to do the same work as the Recoll file system indexer: look for modified documents, extract their text, call the API for indexing them, and the one for purging the data for deleted documents.

A description of the API method follows, but you can also jump ahead for a look at some sample pseudo-code and a pair of actual implementations, one of which does something useful.