A few elements in the interface are specific and and need an explanation.
- ipath
An
ipath
identifies an embedded document inside a standalone one (designated by an URL). The value, if needed, is stored along with the URL, but not indexed. It is accessible or set as a field in the Doc object.ipath
s are opaque values for the lower index layers (Doc objects producers or consumers), and their use is up to the specific indexer. For example, the Recoll file system indexer uses theipath
to store the part of the document access path internal to (possibly imbricated) container documents.ipath
in this case is a vector of access elements (e.g, the first part could be a path inside a zip file to an archive member which happens to be an mbox file, the second element would be the message sequential number inside the mbox etc.). The index itself has no knowledge of this hierarchical structure.At the moment, only the filesystem indexer uses hierarchical
ipath
s (neither the Web nor the Joplin one do), and there are some assumptions in the upper software layers about their structure. For example, the Recoll GUI knows about using an FS indexeripath
for such functions as opening the immediate parent of a given document.url
andipath
are returned in every search result and define the access to the original document.ipath
is empty for top-level document/files (e.g. a PDF document which is a filesystem file).- udi
An
udi
(unique document identifier) identifies a document. Because of limitations inside the index engine, it is restricted in length (to 200 bytes). The structure and contents of theudi
is defined by the application and opaque to the index engine. For example, the internal file system indexer uses the complete document path (file path + internal path), truncated to a maximum length, the suppressed part being replaced by a hash value to retain practical unicity.To rephrase, and hopefully clarify: the filesystem indexer can't use the URL+ipath as a unique document-identifying term because this may be too big: it derives a shorter
udi
from URL+ipath. Another indexer could use a completely different method. For example, the Joplin indexer uses the note ID.- parent_udi
If this attribute is set on a document when entering it in the index, it designates its physical container document. In a multilevel hierarchy, this may not be the immediate parent. If the indexer uses the
purge()
method, then the use ofparent_udi
is mandatory for subdocuments. Else it is optional, but its use by an indexer may simplify index maintenance, as Recoll will automatically delete all children defined byparent_udi == udi
when the document designated byudi
is destroyed. e.g. if aZip
archive contains entries which are themselves containers, likembox
files, all the subdocuments inside theZip
file (mbox, messages, message attachments, etc.) would have the sameparent_udi
, matching theudi
for theZip
file, and all would be destroyed when theZip
file (identified by itsudi
) is removed from the index.- Stored and indexed fields
The
fields
file inside the Recoll configuration defines which document fields are eitherindexed
(searchable),stored
(retrievable with search results), or both. Apart from a few standard/internal fields, only thestored
fields are retrievable through the Python search interface.