Generating a custom field and using it to sort results

We are going to show how to generate a custom field from a Recoll filter, and use it for sorting results. The example chosen comes from an actual user request: sorting results on pdf page counts.

The details here are obsolete, as the pdf input handler is now a quite different python program, but the general idea is still relevant.

The page count from a pdf file can be displayed by the pdfinfo command (xpdf or poppler tools).

We first modify a copy of the rclpdf filter ('/usr/[local/]share/recoll/filters/rclpdf'), to compute the pdf page count, and output the value as an html meta field. This is a not very interesting bit of shell/awk magic. Another approach would be to just rewrite the rclpdf filter in your favorite scripting language (ie: perl, python…​), as all it does is execute pdftotext and pdfinfo and output html, nothing complicated. Here follows the rclpdf modification as a pseudo patch:

# compute the page count and format it so that it's alphabetically sortable
+set `pdfinfo "$infile" | egrep ^Pages:`
+pages=`printf "%04d" $2`
[skip...]
# Pass the page count value to awk
-awk 'BEGIN'\
+awk -v Pages="$pages" 'BEGIN'\
[skip...]
# Inside the awk program startup section: compute the "meta" field line
+  pagemeta = "<meta name=\"pdfpages\" content=\"" Pages "\">\n"
[skip...]
# Then print it as part of the header:
+    $0 =  part1 charsetmeta pagemeta part2
[skip...]

You can execute your own version of rclpdf by modifying '~/.recoll/mimeconf':

[index]
application/pdf = exec /path/to/my/own/rclpdf

At this point, recollindex would receive and extract a pdfpages field, but it would not know what to do with it. We are going to tell it to store the value inside the document data record so that it can be displayed in the results, and sorted on. For this we modify the '~/.recoll/fields' file:

[stored]
pdfpages=

That’s it ! After reindexing, you can now display pdfpages inside the result list (add a %(pdfpages) value to the paragraph format), and display pdfpages inside the result table (right-click the table header), and sort the results on page count (click the column header).

Note that pdfpages has not been defined as searchable (this would not make much sense). For this, you’d have to define a prefix and add it to the [prefixes] fields file section:

[prefixes]
pdfpages = XYPDFP

Have a look at the comments inside the 'fields' file for more information.