General features
-
Easy installation. No database daemon, web server, desktop environment or exotic language necessary.
-
Runs on most Unix-based systems, MS-Windows, and Mac OS X.
-
Qt GUI, WEB, command line, Gnome Shell Search Plugin, Unity Scope, Dolphin KIO and krunner interfaces.
-
Searches most common document types, emails and their attachments. Transparently handles decompression (zip, gzip, bzip2, etc.).
-
Powerful query facilities, with boolean searches, phrases, proximity, wildcards, filter on file types and directory tree, accessible through the query language or a GUI query builder interface.
-
Multi-language and multi-character set with Unicode based internals.
-
Extensive documentation, with a complete user manual and
man
pages for each command.. -
Can use a Firefox extension to index visited Web pages history. See the Howto for more detail.
-
Processes all email attachments, and more generally any realistic level of container imbrication (the "msword attachment to a message inside a mailbox in a zip" thingy…) .
-
Multiple selectable databases.
-
Support for multiple charsets. Internal processing and storage uses Unicode UTF-8.
-
Stemming performed at query time (can switch stemming language after indexing).
-
An indexer which can be set to run either as a batch, cron’able program, or, on Unix-like systems, as a real-time indexing daemon.
Document types
Recoll can index many document types (along with their compressed versions). Some types are processed internally (no external application needed). Other types need a separate application to be installed to extract the text. Types that only need very common utilities (awk/sed/groff/iconv, Python etc.) are listed in the native section.
The MS-Windows installer includes the supporting applications, no additional installations are needed.
Many formats are processed by Python3 scripts. Formats which are processed using only the Python standard library are listed in the native section.
After installing a missing handler, you may need to tell recollindex
to retry the failed files, by
adding option -k
to the command line (this can also be done from the GUI File→Special indexing
menu). This is because recollindex
in its default operation mode will not retry files which caused
an error during an earlier pass. In special cases, it may be useful to reset the data for a category
of files before indexing. See the recollindex
manual page. If your index is not too big, it may be
simpler to just reset it.
File types indexed natively
The formats in this section would be processed by a basic Recoll installation and its immediate dependencies without need for installing further applications or modules.
-
text
-
html
-
maildir, mh, and mailbox (Mozilla, Thunderbird and Evolution mail ok). Evolution note: be sure to remove .cache from the
skippedNames
list in the GUIIndexing preferences/Local Parameters
pane if you want to index local copies of Imap mail. Outlook archives are processed with an external helper, see further. -
gaim and purple log files.
-
Scribus files.
-
Man pages (needs groff).
-
Mimehtml web archive format (this is based on the mail handler, which introduces some mild weirdness, but is still usable).
Recoll processes most XML-based formats internally (using the libxml2 and libxslt C++ libraries).
-
OpenOffice files.
-
Microsoft Office Open XML files.
-
Abiword files.
-
Kword files.
-
Fb2 ebooks.
-
SVG files.
-
Gnumeric files.
-
Okular annotations files.
The following use Python3 and its standard library:
-
Excel and Powerpoint files (pre-open-xml).
-
Zip archives.
-
Joplin notes (see the details here).
-
Dia diagrams.
-
Tar archives (and their compressed versions). Tar file indexing is disabled by default (because tar archives don’t typically contain the kind of documents that people search for), you will need to enable it explicitely, e.g., with the following in your
$HOME/.recoll/mimeconf
file:[index] application/x-tar = execm rcltar.py
Replace rcltar.py with rcltar for old recoll versions. You can check which exists by looking in the handlers directory, e.g., /usr/share/recoll/filters/.
-
Konqueror webarchive format (uses the
tarfile
Python standard library module). -
Mac OS webarchive format (not to be confused with above). From Recoll 1.42.2, only on Mac OS (uses the Mac
textutil
program, which is installed by default which is why we count the format as internal).
File types indexed with external helpers
The following need miscellaneous helper programs or libraries to extract the document text.
-
PDF needs the pdftotext command, which comes with poppler. The package name is quite often
poppler-utils
. Note: the older pdftotext command which comes with xpdf is not compatible with Recoll. PDF has its own section further, with details about metadata, attachments, annotations, OCR, and opening documents at the right page. -
Microsoft Word is processed with antiword, which is not maintained much, but keeps working. I maintain a very slightly improved antiword version, it can extract a little extra data in some cases. It is also useful to have wvWare installed as Recoll can use it as a fallback for some files which antiword does not handle.
-
RTF files with unrtf. Note that up to version 0.21.3, unrtf mostly does not work with non western-european character sets. Many serious problems (crashes with serious security implications and infinite loops) were fixed in unrtf 0.21.8, so you really want to use this or a newer release. Building unrtf from source is quick and easy, but most distributions now have an up to date unrtf.
-
CHM (Microsoft help) files with chmlib. Recoll bundles the Python3 bindings for the library (ported from pychm which originally did not support Python3).
-
EPUB files with Python and the epub module, which is not packaged on Debian. The packaged version by the original author (0.5.2) is old and suffers from a lot of bitrot, so Recoll now bundles an unpackaged version, updated by Arthur Darcet: no need to install anything.
-
Audio tags: Recoll uses a Python script based on the mutagen package, which you need to install.
-
Images tags are extracted with perl and exiftool.
-
Microsoft Outlook .pst and .ost files are processed with libpff. We use a slightly modified version (to provide streaming output), stored in this repository. This is bundled with Recoll for Windows. You will need to build it on other platforms and ensure that the pffexport program is in the executable PATH when recoll/recollindex is run.
-
Hancom office Hanword .hwp format for Korean text processing, using the pyhwp Python module. See the the module page. Use
pip3 install pyhwp
to install on Linux. This is bundled with Recoll for Windows. On Debian, you also probably want to install the fonts-nanum package, which is not part of the default install. -
Wordperfect is processed with the wpd2html command from libwpd package. On some distributions, the command may come with a package named
libwpd-tools
or such, not the baselibwpd
package. -
djvu with DjVuLibre.
-
GNU info files are processed with Python and the info command.
-
Lyx files need Lyx to be installed.
-
Rar archives with either the unrar or rarfile python package. rarfile is a wrapper over the unrar command. unrar uses the libunrar library and works much better. The library can sometimes be found as a standard package (Ubuntu libunrar5) or be built from the unrar source code. The rarfile wrapper is packaged as python3-rarfile by Debian. unrar is generally not packaged, use
pip3 install unrar
to install it. Note that the free version of unrar (unrar-free
) fails for many files with the message "Failed the read enough data". -
7zip archives with py7zr The Recoll handler can alternatively use pylzma, but this fails on some archives. Neither is packaged by all distributions, and you may have to use pip3 for installation.
-
iCalendar(.ics) files with the icalendar module.
-
Mozilla calendar data. See the Howto about this.
-
Postscript with the ghostscript, ps2pdf command, and pdftotext from poppler.
-
TeX with untex. If there is no untex package for your distribution, this site stores a source package, as untex has no obvious home. Will also work with detex if this is installed.
-
DVI with catdvi.
-
Midi karaoke files need the 'six' and 'chardet' Python3 modules.
Desktop and WEB integration
The Recoll GUI has many features that help to specify an efficient search and to manage the results. However it maybe sometimes preferable to use a simpler tool with a better integration with your desktop interfaces. Several solutions exist:
-
The Recoll Web UI lets you query a Recoll index from a WEB browser. The one linked here, from framagit.org, is a bit more up to date than the one on GitHub (by GitHub user koniu), and is the one to use.
-
The Recoll Gnome Shell Search Provider allows searching from the Gnome Shell.
-
The Recoll KIO module allows starting queries and viewing results from the Dolphin file manager, the Konqueror browser or other KDE applications
Open
dialogs. -
The recollrunner krunner module allows integrating Recoll search results into a krunner query.
-
The Ubuntu Unity Recoll Scope (or Lens for older Unity versions) lets you access Recoll search from the Unity Dash. More slightly obsolete information here.
Recoll also has Python and PHP extension modules which allow easy integration with WEB or other applications.
Stemming
Stemming is a process which transforms inflected words into their most basic form. For example, flooring, floors, floored would probably all be transformed to floor by a stemmer for the English language.
In many search engines, the stemming process occurs during indexing. The index will only contain the stemmed form of words, with exceptions for terms which are detected as being probably proper nouns (ie: capitalized). At query time, the terms entered by the user are stemmed, then matched against the index.
This process results into a smaller index, but it has the grave inconvenient of irrevocably losing information during indexing.
Recoll works in a different way.
-
No destructive stemming is performed during indexing, so that all information gets into the index. The resulting index is bigger, but most people probably don’t care much about this nowadays, because they have a 100Gb disk 95% full of binary data which does not get indexed.
-
At the end of an indexing pass, Recoll builds one or several stemming dictionaries, where all word stems are listed in correspondence to the list of their derivatives.
-
At query time, by default, user-entered terms are stemmed, then matched against the stem database, and the query is expanded to include all derivatives. This will yield search results analogous to those obtained by a classical engine.
The benefits of this approach is that stem expansion can be controlled instantly at query time in several ways:
-
It can be selectively turned-off for any query term by capitalizing it (Floor).
-
The stemming language (ie: english, french…) can be selected (this supposes that several stemming databases have been built, which can be configured as part of the indexing, or done later, in a reasonably fast way).
Special features for PDF
Recoll needs the pdftk command to index PDF attachments. The package is commonly found in the system repository for Linux distributions. This not available with the Windows Recoll package at the moment.
Recoll can extract process custom XMP metadata fields, mapping them to query fields. See the manual. This needs the pdfinfo command, which usually comes in the same package as pdftotext. This is included in the Windows Recoll package.
Recoll can extract and index PDF annotations by using the poppler-glib Python bindings. These are normally available with the Poppler installation on Linux systems, by adding the Poppler GObject introspection data, which can come in variously named packages, for example gir1.2-poppler-0.18 on Ubuntu, typelib-1_0-Poppler-0_18 on OpenSuse. The actual versions may differ of course. This is not available on Windows at the moment.
The default configuration uses the evince application to open PDF files. evince has options for direct page access and pre-setting the search strings (hits will be highlighted). There are examples in the default mimeview for doing the same thing with qpdfview (+qpdfview --search %s %f#%p_). Okular does not have a search string option (but it does have a page number one). Evince is available for Windows, but you will need to install it separately.
Recoll can automatically perform OCR on pure image PDF files, and cache the results so that OCR is not needed for future indexing passes. The function can currently use either tesseract or ABBYY FineReader. See the manual. NOTE: using OCRmyPDF is probably a much better approach in most cases.