There are two parts in the indexing interface:
Methods inside the
recoll
module allow the foreign indexer to update the index.An interface based on scripts execution is defined for executing the indexer (from recollindex) and to allow either the GUI or the
rclextract
module to access original document data for previewing or editing.
Two sample scripts are included with the Recoll source and described in more detail a bit further.
The update methods are part of the recoll
module. The
connect() method is used with a writable=true
parameter to obtain a
writable Db
object. The following Db
object methods
are then available.
Note that the changes are only guaranteed to be flushed to the index when
db.close()
is called. This normally occurs when the program exits, but
it is much safer to use an explicit call after making the changes.
- addOrUpdate(udi, doc, parent_udi=None, metaonly=False)
Add or update index data for a given document.
The
udi
string must define a unique id for the document. It is an opaque interface element and not interpreted inside the lower level Recoll code.doc
is aDoc
object, containing the data to be indexed.If
parent_udi
is set, this is a unique identifier for the top-level container, the document for whichneedUpdate()
would be called (e.g. for the filesystem indexer, this would be the one which is an actual file).If
metaonly
is set, the main document text (in thedoc.text
attribute) will be ignored and only the other metadata fields present in thedoc
object will be processed. This can be useful for updating the metadata apart from the main text.Document attributes:
doc.text
should have the main text. It is ignored ifmetaonly
is set. For actual indexing (metaonly
not set), theurl
and possiblyipath
fields should also be set to allow access to the actual document after a query. Other fields may also need to be set by an external indexer see the description further down:rclbes
,sig
,mimetype
. Of course, any standard or custom Recoll field can also be added.- delete(udi)
Purge the index from all data for
udi
, and all documents (if any) which haveudi
asparent_udi
.- needUpdate(udi, sig)
Test if the index needs to be updated for the document identified by
udi
. If this call is to be used, thedoc.sig
field should contain a signature value when callingaddOrUpdate()
. TheneedUpdate()
call then compares its parameter value with the storedsig
forudi
.sig
is an opaque value, compared as a string.The filesystem indexer uses a concatenation of the decimal string values for file size and update time, but a hash of the contents could also be used.
As a side effect, if the return value is false (the index is up to date), the call will set the existence flag for the document (and any subdocument defined by its
parent_udi
), so that a laterpurge()
call will preserve them.The use of
needUpdate()
andpurge()
is optional, and the indexer may use another method for checking the need to reindex or to delete stale entries.- preparePurge(backend_name)
Mark all documents which do *not* belong to
backend_name
as existing.backend_name
is the value chosen for therclbes
field for the indexer documents (e.g. "MBOX", "JOPLIN"... for the samples). This is a mandatory call before starting an update if the index is shared with other backends and you are going to call purge() after the update, else all documents for other backends will be deleted from the index by the purge.- purge()
Delete all documents that were not touched during the just finished indexing pass (since
preparePurge()
). These are the documents for which theneedUpdate()
call was not performed, indicating that they no longer exist in the storage system.- createStemDbs(lang|sequence of langs)
Create stemming dictionaries for query stemming expansion. Note that this is not needed at all if the indexing is done from the recollindex program, as it will perform this action after calling all the external indexers. Should be called when done updating the index. Available only after Recoll 1.34.3. As an alternative, you can close the index and execute:
recollindex -c <confdir> -s <lang(s)>
The Python module currently has no interface to the Aspell speller functions, so the same approach can be used for creating the spelling dictionary (with option
-S
) (again, not needed if recollindex is driving the indexing).
Recoll has internal methods to access document data for its internal (filesystem)
indexer. An external indexer needs to provide data access methods if it needs
integration with the GUI (e.g. preview function), or support for
the rclextract
module.
An external indexer needs to provide two commands, for fetching data (typically for previewing) and for computing the document signature (for up-to-date checks when opening or previewing). The sample MBOX and JOPLIN implementations use the same script with different parameters to perform both operations, but this is just a choice. A third command must be provided for performing the indexing proper.
The "fetch" and "makesig" scripts are called with three additional
arguments: udi
, url
,
ipath
. These were set by the indexer and stored with the document by
the addOrUpdate()
call described above. Not all arguments are needed
in all cases, the script will use what it needs to perform the requested operation. The
caller expects the result data on stdout
.
recollindex will set the RECOLL_CONFDIR
environment variable when executing the scripts, so that the configuration can be
created as
rclconf = rclconfig.RclConfig()
if needed, and the configuration directory obtained as
confdir = rclconf.getConfDir()