XMP fields extraction

The rclpdf.py script in Recoll version 1.23.2 and later can extract XMP metadata fields by executing the pdfinfo command (usually found with poppler-utils). This is controlled by the pdfextrameta configuration variable, which specifies which tags to extract and, possibly, how to rename them.

The pdfextrametafix variable can be used to designate a file with Python code to edit the metadata fields (available for Recoll 1.23.3 and later. 1.23.2 has equivalent code inside the handler script). Example:

import sys
import re

class MetaFixer(object):
def __init__(self):
pass

def metafix(self, nm, txt):
if nm == 'bibtex:pages':
txt = re.sub(r'--', '-', txt)
elif nm == 'someothername':
# do something else
pass
elif nm == 'stillanother':
# etc.
pass

return txt
def wrapup(self, metaheaders):
pass

If the 'metafix()' method is defined, it is called for each metadata field. A new MetaFixer object is created for each PDF document (so the object can keep state for, for example, eliminating duplicate values). If the 'wrapup()' method is defined, it is called at the end of XMP fields processing with the whole metadata as parameter, as an array of '(nm, val)' pairs, allowing an alternate approach for editing or adding/deleting fields.

See this page for a more detailed discussion about indexing PDF XMP properties.