Input handler output

Both the simple and persistent input handlers can return any MIME type to Recoll, which will further process the data according to the MIME configuration.

Most input filters filters produce either text/plain or text/html data. There are exceptions, for example, filters which process archive file (zip, tar, etc.) will usually return the documents as they are found, without processing them further.

There is nothing to say about text/plain output, except that its character encoding should be consistent with what is specified in the mimeconf file.

For filters producing HTML, the output could be very minimal like the following example:

          <html>
            <head>
              <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"/>
            </head>
            <body>
              Some text content
            </body>
          </html>
          

You should take care to escape some characters inside the text by transforming them into appropriate entities. At the very minimum, "&" should be transformed into "&amp;", "<" should be transformed into "&lt;". This is not always properly done by external helper programs which output HTML, and of course never by those which output plain text.

When encapsulating plain text in an HTML body, the display of a preview may be improved by enclosing the text inside <pre> tags.

The character set needs to be specified in the header. It does not need to be UTF-8 (Recoll will take care of translating it), but it must be accurate for good results.

Recoll will process meta tags inside the header as possible document fields candidates. Documents fields can be processed by the indexer in different ways, for searching or displaying inside query results. This is described in a following section.

By default, the indexer will process the standard header fields if they are present: title, meta/description, and meta/keywords are both indexed and stored for query-time display.

A predefined non-standard meta tag will also be processed by Recoll without further configuration: if a date tag is present and has the right format, it will be used as the document date (for display and sorting), in preference to the file modification date. The date format should be as follows:

          <meta name="date" content="YYYY-mm-dd HH:MM:SS">
          or
          <meta name="date" content="YYYY-mm-ddTHH:MM:SS">
        

Example:

          <meta name="date" content="2013-02-24 17:50:00">
        

Input handlers also have the possibility to "invent" field names. This should also be output as meta tags:

          <meta name="somefield" content="Some textual data" />
        

You can embed HTML markup inside the content of custom fields, for improving the display inside result lists. In this case, add a (wildly non-standard) markup attribute to tell Recoll that the value is HTML and should not be escaped for display.

          <meta name="somefield" markup="html" content="Some <i>textual</i> data" />
        

As written above, the processing of fields is described in a further section.

Persistent filters can use another, probably simpler, method to produce metadata, by calling the setfield() helper method. This avoids the necessity to produce HTML, and any issue with HTML quoting. See, for example, rclaudio.py in Recoll 1.23 and later for an example of handler which outputs text/plain and uses setfield() to produce metadata.