Interface elements

A few elements in the interface are specific and and need an explanation.

ipath

An ipath identifies an embedded document inside a standalone one (designated by an URL). The value, if needed, is stored along with the URL, but not indexed. It is accessible or set as a field in the Doc object.

ipaths are opaque values for the lower index layers (Doc objects producers or consumers), and their use is up to the specific indexer. For example, the Recoll file system indexer uses the ipath to store the part of the document access path internal to (possibly imbricated) container documents. ipath in this case is a vector of access elements (e.g, the first part could be a path inside a zip file to an archive member which happens to be an mbox file, the second element would be the message sequential number inside the mbox etc.). The index itself has no knowledge of this hierarchical structure.

At the moment, only the filesystem indexer uses hierarchical ipaths (neither the Web nor the Joplin one do), and there are some assumptions in the upper software layers about their structure. For example, the Recoll GUI knows about using an FS indexer ipath for such functions as opening the immediate parent of a given document.

url and ipath are returned in every search result and define the access to the original document. ipath is empty for top-level document/files (e.g. a PDF document which is a filesystem file).

udi

An udi (unique document identifier) identifies a document. Because of limitations inside the index engine, it is restricted in length (to 200 bytes). The structure and contents of the udi is defined by the application and opaque to the index engine. For example, the internal file system indexer uses the complete document path (file path + internal path), truncated to a maximum length, the suppressed part being replaced by a hash value to retain practical unicity.

To rephrase, and hopefully clarify: the filesystem indexer can't use the URL+ipath as a unique document-identifying term because this may be too big: it derives a shorter udi from URL+ipath. Another indexer could use a completely different method. For example, the Joplin indexer uses the note ID.

parent_udi

If this attribute is set on a document when entering it in the index, it designates its physical container document. In a multilevel hierarchy, this may not be the immediate parent. If the indexer uses the purge() method, then the use of parent_udi is mandatory for subdocuments. Else it is optional, but its use by an indexer may simplify index maintenance, as Recoll will automatically delete all children defined by parent_udi == udi when the document designated by udi is destroyed. e.g. if a Zip archive contains entries which are themselves containers, like mbox files, all the subdocuments inside the Zip file (mbox, messages, message attachments, etc.) would have the same parent_udi, matching the udi for the Zip file, and all would be destroyed when the Zip file (identified by its udi) is removed from the index.

Stored and indexed fields

The fields file inside the Recoll configuration defines which document fields are either indexed (searchable), stored (retrievable with search results), or both. Apart from a few standard/internal fields, only the stored fields are retrievable through the Python search interface.