Recoll knows about quite a few different document types. The parameters for document types recognition and processing are set in configuration files.
Most file types, like HTML or word processing files, only hold one document. Some file types, like email folders or zip archives, can hold many individually indexed documents, which may themselves be compound ones. Such hierarchies can go quite deep, and Recoll can process, for example, a LibreOffice document stored as an attachment to an email message inside an email folder archived in a zip file...
recollindex processes plain text, HTML, OpenDocument (Open/LibreOffice), email formats, and a few others internally.
Other file types (e.g.: postscript, pdf, ms-word, rtf ...)
need external applications for preprocessing. The list is in the
installation
section. After every indexing operation, Recoll updates a list of
commands that would be needed for indexing existing files
types. This list can be displayed by selecting the menu option
→
in the recoll GUI. It is stored in the
missing text file inside the configuration
directory.
After installing a missing handler, you may need to
tell recollindex
to retry the failed files, by adding option -k
to the command line, or by using the GUI
→ menu. This is because recollindex, in its default
operation mode, will not retry files which caused an error during an earlier pass. In
special cases, it may be useful to reset the data for a category of files before
indexing. See the recollindex manual page. If your index is not too
big, it may be simpler to just reset it.
By default, Recoll will try to index any file type that it has a way to read. This is sometimes not desirable, and there are ways to either exclude some types, or on the contrary define a positive list of types to be indexed. In the latter case, any type not in the list will be ignored. A detailed description of the parameters involved can be found in the document selection section of this manual.
For example, to define an exclusive list of MIME types to be indexed, you would set the indexedmimetypes configuration variable:
indexedmimetypes = text/html application/pdf
It is possible to redefine a parameter for subdirectories. Example:
[/path/to/my/dir] indexedmimetypes = application/pdf
When using sections like this, don't forget that they remain in effect until the end of the file or another section indicator.
As another example, excluding files by name can be done by adding wildcard name patterns to the skippedNames list, Excluding by type can be done by setting the excludedmimetypes value.
Most parameters can be set either by editing the configuration file (recoll.conf)
for the index, or by using the GUI Index configuration menu.
Note about MIME types
When editing the indexedmimetypes
or excludedmimetypes lists, you should use the MIME values listed in
the mimemap file or in
Recoll result lists rather than file -i output: there are a number of
differences. The system command output should only be used for files without
extensions, or for which the extension is not listed in
mimemap.

