Document types

Recoll knows about quite a few different document types. The parameters for document types recognition and processing are set in configuration files.

Most file types, like HTML or word processing files, only hold one document. Some file types, like email folders or zip archives, can hold many individually indexed documents, which may themselves be compound ones. Such hierarchies can go quite deep, and Recoll can process, for example, a LibreOffice document stored as an attachment to an email message inside an email folder archived in a zip file...

recollindex processes plain text, HTML, OpenDocument (Open/LibreOffice), email formats, and a few others internally.

Other file types (e.g.: postscript, pdf, ms-word, rtf ...) need external applications for preprocessing. The list is in the installation section. After every indexing operation, Recoll updates a list of commands that would be needed for indexing existing files types. This list can be displayed by selecting the menu option FileShow Missing Helpers in the recoll GUI. It is stored in the missing text file inside the configuration directory.

After installing a missing handler, you may need to tell recollindex to retry the failed files, by adding option -k to the command line, or by using the GUI FileSpecial indexing menu. This is because recollindex, in its default operation mode, will not retry files which caused an error during an earlier pass. In special cases, it may be useful to reset the data for a category of files before indexing. See the recollindex manual page. If your index is not too big, it may be simpler to just reset it.

By default, Recoll will try to index any file type that it has a way to read. This is sometimes not desirable, and there are ways to either exclude some types, or on the contrary define a positive list of types to be indexed. In the latter case, any type not in the list will be ignored.

Excluding files by name can be done by adding wildcard name patterns to the skippedNames list, which can be done from the GUI Index configuration menu. Excluding by type can be done by setting the excludedmimetypes list in the configuration file. This can be redefined for subdirectories.

You can also define an exclusive list of MIME types to be indexed (no others will be indexed), by setting the indexedmimetypes configuration variable. Example:

indexedmimetypes = text/html application/pdf

It is possible to redefine this parameter for subdirectories. Example:

indexedmimetypes = application/pdf

(When using sections like this, don't forget that they remain in effect until the end of the file or another section indicator).

excludedmimetypes or indexedmimetypes, can be set either by editing the configuration file (recoll.conf) for the index, or by using the GUI index configuration tool.

Note about MIME types

When editing the indexedmimetypes or excludedmimetypes lists, you should use the MIME values listed in the mimemap file or in Recoll result lists in preference to file -i output: there are a number of differences. The file -i output should only be used for files without extensions, or for which the extension is not listed in mimemap