External indexer samples

First a quick look at an indexer main part, using pseudo-Python3 code:

    # Connect to the recoll index. This will use the RECOLL_CONFDIR variable, set
    # by the parent recollindex process, to use the right index.
    rcldb = recoll.connect(writable=1)

    # Important: tell the Recoll db that we are going to update documents for the
    # MYBACK backend. All other documents will be marked as present so as
    # not to be affect by the subsequent purge.
    rcldb.preparePurge("MYBACK")

    # Walk your dataset (of course your code will not look like this)
    for mydoc in mydoclist:
        # Compute the doc unique identifier and the signature corresponding to its update state
        # (e.g. mtime and size for a file).
        udi = mydoc.udi()
        sig = mydoc.sig()
        # Check with recoll if the document needs updating. This has the side-effect or marking
        # it present.
        if not rcldb.needUpdate(udi, sig):
            continue
        # The document data does not exist in the index or needs updating. Create and add a Recoll
        # Doc object
        doc = recoll.Doc()
        doc.mimetype = "some/type"
        # Say that the document belongs to this indexer
        doc.rclbes = "MYBACK"
        # The url will be passed back to you along with the udi if the fetch
        # method is called later (for previewing), or may be used for opening the document with
        # its native app from Recoll. The udi has a maximum size because it is used as a Xapian
        # term. The url has no such limitation.
        doc.url = "someurl"
        doc.sig = sig
        # Of course add other fields like "text" (duh), "author" etc. See the samples.
        doc.text = mydoc.text()
        # [...]
        # Then add or update the data in the index.
        self.db.addOrUpdate(udi, doc)

    # Finally call purge to delete the data for documents which were not seen at all.
    db.purge()

The Recoll source tree has two samples of external indexers.

  • rclmbox.py indexes a directory containing mbox folder files. Of course it is not really useful because Recoll can do this by itself, but it exercises most features in the update interface, and it has both top-level and embedded documents so it demonstrates the uses of the ipath values.
  • rcljoplin.py indexes a Joplin application main notes SQL table. Joplin sets an an update date attribute for each record in the table, so each note record can be processed as a standalone document (no ipath necessary). The sample has full preview and open support (the latter using a Joplin callback URL which allows displaying the result note inside the native app), so it could actually be useful to perform a unified search of the Joplin data and the regular Recoll data. As of Recoll 1.37.0, the Joplin indexer is part of the default installation (see the features section of the Web site for more information).

See the comments inside the scripts for more information.