Field data processing

Fields are named pieces of information in or about documents, like title, author, abstract.

The field values for documents can appear in several ways during indexing: either output by input handlers as meta fields in the HTML header section, or extracted from file extended attributes, or added as attributes of the Doc object when using the API, or again synthetized internally by Recoll.

The Recoll query language allows searching for text in a specific field.

Recoll defines a number of default fields. Additional ones can be output by handlers, and described in the fields configuration file.

Fields can be:

  • indexed, meaning that their terms are separately stored in inverted lists (with a specific prefix), and that a field-specific search is possible.

  • stored, meaning that their value is recorded in the index data record for the document, and can be returned and displayed with search results.

A field can be either or both indexed and stored. This and other aspects of fields handling is defined inside the fields configuration file.

Some fields may also designated as supporting range queries, meaning that the results may be selected for an interval of its values. See the configuration section for more details.

The sequence of events for field processing is as follows:

  • During indexing, recollindex scans all meta fields in HTML documents (most document types are transformed into HTML at some point). It compares the name for each element to the configuration defining what should be done with fields (the fields file)

  • If the name for the meta element matches one for a field that should be indexed, the contents are processed and the terms are entered into the index with the prefix defined in the fields file.

  • If the name for the meta element matches one for a field that should be stored, the content of the element is stored with the document data record, from which it can be extracted and displayed at query time.

  • At query time, if a field search is performed, the index prefix is computed and the match is only performed against appropriately prefixed terms in the index.

  • At query time, the field can be displayed inside the result list by using the appropriate directive in the definition of the result list paragraph format. All fields are displayed on the fields screen of the preview window (which you can reach through the right-click menu). This is independent of the fact that the search which produced the results used the field or not.

You can find more information in the section about the fields file, or in comments inside the file.

You can also have a look at the example in the FAQs area, detailing how one could add a page count field to pdf documents for displaying inside result lists.