Both the simple and persistent input handlers can return any MIME type to Recoll, which will further process the data according to the MIME configuration.
Most input filters filters produce either
text/plain
or text/html
data. There are exceptions, for example, filters which process
archive file (zip
, tar
, etc.)
will usually return the documents as they are found, without
processing them further.
There is nothing to say about text/plain
output, except that its character encoding should be consistent
with what is specified in the mimeconf
file.
For filters producing HTML, the output could be very minimal like the following example:
<html> <head> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"/> </head> <body> Some text content </body> </html>
You should take care to escape some
characters inside the text by transforming them into
appropriate entities. At the very minimum,
"&
" should be transformed into
"&
", "<
"
should be transformed into
"<
". This is not always properly
done by external helper programs which output HTML, and of
course never by those which output plain text.
When encapsulating plain text in an HTML body,
the display of a preview may be improved by enclosing the
text inside <pre>
tags.
The character set needs to be specified in the header. It does not need to be UTF-8 (Recoll will take care of translating it), but it must be accurate for good results.
Recoll will process meta
tags inside
the header as possible document fields candidates. Documents
fields can be processed by the indexer in different ways,
for searching or displaying inside query results. This is
described in a following section.
By default, the indexer will process the standard header
fields if they are present: title
,
meta/description
,
and meta/keywords
are both indexed and stored
for query-time display.
A predefined non-standard meta
tag
will also be processed by Recoll without further
configuration: if a date
tag is present
and has the right format, it will be used as the document
date (for display and sorting), in preference to the file
modification date. The date format should be as follows:
<meta name="date" content="YYYY-mm-dd HH:MM:SS">
or
<meta name="date" content="YYYY-mm-ddTHH:MM:SS">
Example:
<meta name="date" content="2013-02-24 17:50:00">
Input handlers also have the possibility to "invent" field names. This should also be output as meta tags:
<meta name="somefield" content="Some textual data" />
You can embed HTML markup inside the content of custom
fields, for improving the display inside result lists. In this
case, add a (wildly non-standard) markup
attribute to tell Recoll that the value is HTML and should not
be escaped for display.
<meta name="somefield" markup="html" content="Some <i>textual</i> data" />
As written above, the processing of fields is described in a further section.
Persistent filters can use another, probably simpler,
method to produce metadata, by calling the
setfield()
helper method. This avoids the
necessity to produce HTML, and any issue with HTML quoting. See,
for example, rclaudio.py
in Recoll 1.23 and
later for an example of handler which outputs
text/plain
and uses
setfield()
to produce metadata.