Recoll knows about quite a few different document types. The parameters for document types recognition and processing are set in configuration files.
Most file types, like HTML or word processing files, only hold one document. Some file types, like email folders or zip archives, can hold many individually indexed documents, which may themselves be compound ones. Such hierarchies can go quite deep, and Recoll can process, for example, a LibreOffice document stored as an attachment to an email message inside an email folder archived in a zip file...
recollindex processes plain text, HTML, OpenDocument (Open/LibreOffice), email formats, and a few others internally.
Other file types (e.g.: postscript, pdf, ms-word, rtf ...)
need external applications for preprocessing. The list is in the
installation
section. After every indexing operation, Recoll updates a list of
commands that would be needed for indexing existing files
types. This list can be displayed by selecting the menu option
→
in the recoll GUI. It is stored in the
missing
text file inside the configuration
directory.
After installing a missing handler, you may need to
tell recollindex
to retry the failed files, by adding option -k
to the command line, or by using the GUI
→ menu. This is because recollindex, in its default
operation mode, will not retry files which caused an error during an earlier pass. In
special cases, it may be useful to reset the data for a category of files before
indexing. See the recollindex manual page. If your index is not too
big, it may be simpler to just reset it.
By default, Recoll will try to index any file type that it has a way to read. This is sometimes not desirable, and there are ways to either exclude some types, or on the contrary define a positive list of types to be indexed. In the latter case, any type not in the list will be ignored.
Excluding files by name can be done by adding wildcard name patterns to the skippedNames list, which can be done from the GUI Index configuration menu. Excluding by type can be done by setting the excludedmimetypes list in the configuration file. This can be redefined for subdirectories.
You can also define an exclusive list of MIME types to be indexed (no others will be indexed), by setting the indexedmimetypes configuration variable. Example:
indexedmimetypes = text/html application/pdf
It is possible to redefine this parameter for subdirectories. Example:
[/path/to/my/dir] indexedmimetypes = application/pdf
(When using sections like this, don't forget that they remain in effect until the end of the file or another section indicator).
excludedmimetypes
or
indexedmimetypes
, can be set either by editing
the configuration file (recoll.conf
)
for the index, or by using the GUI index configuration tool.
Note about MIME types
When editing the indexedmimetypes
or excludedmimetypes
lists, you should use the MIME values listed in
the mimemap
file or in Recoll result lists in preference
to file -i
output: there are a number of differences. The
file -i
output should only be used for files
without extensions, or for which the extension is not listed in
mimemap