Simple input handlers

Recoll simple handlers are usually shell-scripts, but this is in no way necessary. Extracting the text from the native format is the difficult part. Outputting the format expected by Recoll is trivial. Happily enough, most document formats have translators or text extractors which can be called from the handler. In some cases the output of the translating program is completely appropriate, and no intermediate shell-script is needed.

Input handlers are called with a single argument which is the source file name. They should output the result to stdout.

When writing a handler, you should decide if it will output plain text or HTML. Plain text is simpler, but you will not be able to add metadata or vary the output character encoding (this will be defined in a configuration file). Additionally, some formatting may be easier to preserve when previewing HTML. Actually the deciding factor is metadata: Recoll has a way to extract metadata from the HTML header and use it for field searches..

The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells the handler if the operation is for indexing or previewing. Some handlers use this to output a slightly different format, for example stripping uninteresting repeated keywords (e.g.: Subject: for email) when indexing. This is not essential.

You should look at one of the simple handlers, for example rclps for a starting point.

Don't forget to make your handler executable before testing !