The rclpdf.py
script in Recoll version
1.23.2 and later can extract XMP metadata fields by executing the
pdfinfo command (usually found with
poppler-utils). This is controlled by
the
pdfextrameta
configuration variable, which specifies which tags to extract and,
possibly, how to rename them.
The pdfextrametafix variable can be used to designate a file with Python code to edit the metadata fields (available for Recoll 1.23.3 and later. 1.23.2 has equivalent code inside the handler script). Example:
import sys import re class MetaFixer(object): def __init__(self): pass def metafix(self, nm, txt): if nm == 'bibtex:pages': txt = re.sub(r'--', '-', txt) elif nm == 'someothername': # do something else pass elif nm == 'stillanother': # etc. pass return txt def wrapup(self, metaheaders): pass
If the 'metafix()' method is defined, it is called for each metadata field. A new MetaFixer object is created for each PDF document (so the object can keep state for, for example, eliminating duplicate values). If the 'wrapup()' method is defined, it is called at the end of XMP fields processing with the whole metadata as parameter, as an array of '(nm, val)' pairs, allowing an alternate approach for editing or adding/deleting fields.
See this page for a more detailed discussion about indexing PDF XMP properties.