Prior to Recoll 1.25, index queries could not provide document content because it was
never stored. Recoll 1.25 and later usually store the document text, which can be
optionally retrieved when running a query (see query.execute()
above
- the result is always plain text).
Independantly, the rclextract
module can give access to the
original document and to the document text content, possibly as an HTML version.
Accessing the original document is particularly useful if it is embedded (e.g. an email
attachment).
You need to import the recoll
module before the
rclextract
module.
- Extractor(doc)
An
Extractor
object is built from aDoc
object, output from a query.- Extractor.textextract(ipath)
Extract document defined by
ipath
and return aDoc
object. Thedoc.text
field has the document text converted to either text/plain or text/html according todoc.mimetype
. The typical use would be as follows:from recoll import recoll, rclextract qdoc = query.fetchone() extractor = rclextract.Extractor(qdoc) doc = extractor.textextract(qdoc.ipath) # use doc.text, e.g. for previewing
Passing
qdoc.ipath
totextextract()
is redundant, but reflects the fact that theExtractor
object actually has the capability to access the other entries in a compound document.- Extractor.idoctofile(ipath, targetmtype, outfile='')
Extracts document into an output file, which can be given explicitly or will be created as a temporary file to be deleted by the caller. Typical use:
from recoll import recoll, rclextract qdoc = query.fetchone() extractor = rclextract.Extractor(qdoc) filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)
In all cases the output is a copy, even if the requested document is a regular system file, which may be wasteful in some cases. If you want to avoid this, you can test for a simple file document as follows:
not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")