General features
-
Easy installation. No database daemon, web server, desktop environment or exotic language necessary.
-
Runs on most Unix-based systems, MS-Windows, and Mac OS X.
-
Qt GUI, WEB, command line, Gnome Shell Search Plugin, Unity Scope, Dolphin KIO and krunner interfaces.
-
Searches most common document types, emails and their attachments. Transparently handles decompression (zip, gzip, bzip2, etc.).
-
Powerful query facilities, with boolean searches, phrases, proximity, wildcards, filter on file types and directory tree, accessible through the query language or a GUI query builder interface.
-
Multi-language and multi-character set with Unicode based internals.
-
Extensive documentation, with a complete user manual and
man
pages for each command.. -
Can use a Firefox extension to index visited Web pages history. See the Howto for more detail.
-
Processes all email attachments, and more generally any realistic level of container imbrication (the "msword attachment to a message inside a mailbox in a zip" thingy…) .
-
Multiple selectable databases.
-
Support for multiple charsets. Internal processing and storage uses Unicode UTF-8.
-
Stemming performed at query time (can switch stemming language after indexing).
-
An indexer which can be set to run either as a batch, cron’able program, or, on Unix-like systems, as a real-time indexing daemon.
Document types
Recoll can index many document types (along with their compressed versions). Some types are processed internally (no external application needed). Other types need a separate application to be installed to extract the text. Types that only need very common utilities (awk/sed/groff/iconv, Python etc.) are listed in the native section.
The MS-Windows installer includes the supporting applications, no additional installations are needed.
Many formats are processed by Python scripts. The Python dependency will not always be mentioned. In general, Recoll up to 1.24 expects Python 2.x to be available. Recoll 1.25 and later use Python3 (most scripts are actually compatible with both versions). Formats which are processed using Python and its standard library only are listed in the native section.
After installing a missing handler, you may need to tell recollindex
to
retry the failed files, by adding option -k
to the command line (this can
also be done from the GUI File→Special indexing
menu). This is because
recollindex
in its default operation mode will not retry files which
caused an error during an earlier pass. In special cases, it may be useful
to reset the data for a category of files before indexing. See the
recollindex
manual page. If your index is not too big, it may be simpler
to just reset it.
File types indexed natively
-
text
-
html
-
maildir, mh, and mailbox (Mozilla, Thunderbird and Evolution mail ok). Evolution note: be sure to remove .cache from the
skippedNames
list in the GUIIndexing preferences/Local Parameters
pane if you want to index local copies of Imap mail. Outlook archives are processed with an external helper, see further. -
gaim and purple log files.
-
Scribus files.
-
Man pages (needs groff).
-
Mimehtml web archive format (this is based on the mail handler, which introduces some mild weirdness, but is still usable).
All the following need Python3:
-
Excel and Powerpoint files (pre-open-xml).
-
Zip archives.
-
Joplin notes (see the details here).
-
Dia diagrams.
-
Tar archives (and their compressed versions). Tar file indexing is disabled by default (because tar archives don’t typically contain the kind of documents that people search for), you will need to enable it explicitely, e.g., with the following in your
$HOME/.recoll/mimeconf
file:[index] application/x-tar = execm rcltar.py
Replace rcltar.py with rcltar for old recoll versions. You can check which exists by looking in the handlers directory, e.g., /usr/share/recoll/filters/.
-
Konqueror webarchive format (uses the
tarfile
Python standard library module). -
Midi karaoke files need the 'six' and 'chardet' Python3 modules.
File types indexed with external helpers
The XML ones
Recoll 1.26 and later process XML internally, by using the libxml2 and libxslt C++ libraries. Quite a few formats also need the unzip command.
Recoll 1.25 used python3-lxml. Versions from 1.22 to 1.24 used python-libxslt and python-libxml2, Versions older than 1.22 needed the xsltproc command.
-
OpenOffice files.
-
Microsoft Office Open XML files.
-
Abiword files.
-
Kword files.
-
Fb2 ebooks.
-
SVG files.
-
Gnumeric files.
-
Okular annotations files.
Other formats
The following need miscellaneous helper programs to extract the document text.
-
PDF needs the pdftotext command, which comes with poppler. The package name is quite often
poppler-utils
. Note: the older pdftotext command which comes with xpdf is not compatible with Recoll. PDF has its own section further, with details about metadata, attachments, annotations, OCR, and opening documents at the right page. -
Microsoft Word is processed with antiword, which is not maintained much, but keeps working. I maintain a very slightly improved antiword version, it can extract a little extra data in some cases. It is also useful to have wvWare installed as Recoll can use it as a fallback for some files which antiword does not handle.
-
RTF files with unrtf. Note that up to version 0.21.3, unrtf mostly does not work with non western-european character sets. Many serious problems (crashes with serious security implications and infinite loops) were fixed in unrtf 0.21.8, so you really want to use this or a newer release. Building unrtf from source is quick and easy.
-
CHM (Microsoft help) files with Python pychm and chmlib. Recoll 1.25 and later bundle a Python3 version of the CHM package, (this is necessary because the original package was not ported to Python3).
-
EPUB files with Python and the epub module, which is packaged on Fedora, but not Debian. The packaged version by the original author (0.5.2) is old and suffers from a lot of bitrot, so Recoll now bundles an unpackaged version, updated by Arthur Darcet.
-
Microsoft Outlook .pst and .ost files are processed with libpff. We use a slightly modified version (to provide streaming output), stored in this repository This is bundled with Recoll for Windows. You will need to build it on other platforms and ensure that the pffexport program is in the executable PATH when recoll/recollindex is run.
-
Hancom office Hanword .hwp format for Korean text processing, using the pyhwp Python module. See the the module page. Use
pip3 install pyhwp
to install on Linux. This will be bundled with Recoll Windows future versions (1.26.6 and later). On Debian, you also probably want to install the fonts-nanum package, which is not part of the default install. -
Wordperfect is processed with the wpd2html command from libwpd package. On some distributions, the command may come with a package named
libwpd-tools
or such, not the baselibwpd
package. -
djvu with DjVuLibre.
-
Audio: Recoll releases 1.14 and later use a Python script based on the mutagen package to extract tags for all audio types.
-
Images tags are extracted with perl and exiftool.
-
GNU info files are processed with Python and the info command.
-
Lyx files need Lyx to be installed.
-
Rar archives with either the unrar or rarfile python package. rarfile is a wrapper over the unrar command. unrar uses the libunrar library and works much better. The library can sometimes be found as a standard package (Ubuntu libunrar5) or be built from the unrar source code. The rarfile wrapper is packaged as python3-rarfile by both Fedora and Debian. unrar is generally not packaged, use
pip3 install unrar
to install it. Note that the free version of unrar (unrar-free
) fails for many files with the message "Failed the read enough data". -
7zip archives with py7zr The Recoll handler can alternatively use pylzma, but this fails on some archives. Neither is packaged by all distributions, and you may have to use pip3 for installation.
-
iCalendar(.ics) files with the icalendar module.
-
Mozilla calendar data. See the Howto about this.
-
Postscript with the ghostscript, ps2pdf command, and pdftotext from poppler.
-
TeX with untex. If there is no untex package for your distribution, this site stores a source package, as untex has no obvious home. Will also work with detex if this is installed.
-
DVI with catdvi.
Supported systems
Recoll has been built and tested on Linux, MS-Windows 7-10, MacOS X and Solaris (initial versions Redhat 7, Fedora Core 5, Suse 10, Gentoo, Debian 3.1, Solaris 8). It should compile and run on all subsequent releases of these systems and probably a few others too.
Desktop and WEB integration
The Recoll GUI has many features that help to specify an efficient search and to manage the results. However it maybe sometimes preferable to use a simpler tool with a better integration with your desktop interfaces. Several solutions exist:
-
The Recoll Web UI lets you query a Recoll index from a WEB browser. The one linked here, from framagit.org, is a bit more up to date than the one on GitHub (by GitHub user koniu), and is the one to use.
-
The Recoll Gnome Shell Search Provider allows searching from the Gnome Shell.
-
The Recoll KIO module allows starting queries and viewing results from the Dolphin file manager, the Konqueror browser or other KDE applications
Open
dialogs. -
The recollrunner krunner module allows integrating Recoll search results into a krunner query.
-
The Ubuntu Unity Recoll Scope (or Lens for older Unity versions) lets you access Recoll search from the Unity Dash. More slightly obsolete information here.
Recoll also has Python and PHP extension modules which allow easy integration with WEB or other applications.
Stemming
Stemming is a process which transforms inflected words into their most basic form. For example, flooring, floors, floored would probably all be transformed to floor by a stemmer for the English language.
In many search engines, the stemming process occurs during indexing. The index will only contain the stemmed form of words, with exceptions for terms which are detected as being probably proper nouns (ie: capitalized). At query time, the terms entered by the user are stemmed, then matched against the index.
This process results into a smaller index, but it has the grave inconvenient of irrevocably losing information during indexing.
Recoll works in a different way.
-
No destructive stemming is performed during indexing, so that all information gets into the index. The resulting index is bigger, but most people probably don’t care much about this nowadays, because they have a 100Gb disk 95% full of binary data which does not get indexed.
-
At the end of an indexing pass, Recoll builds one or several stemming dictionaries, where all word stems are listed in correspondence to the list of their derivatives.
-
At query time, by default, user-entered terms are stemmed, then matched against the stem database, and the query is expanded to include all derivatives. This will yield search results analogous to those obtained by a classical engine.
The benefits of this approach is that stem expansion can be controlled instantly at query time in several ways:
-
It can be selectively turned-off for any query term by capitalizing it (Floor).
-
The stemming language (ie: english, french…) can be selected (this supposes that several stemming databases have been built, which can be configured as part of the indexing, or done later, in a reasonably fast way).
Special features for PDF
Recoll needs the pdftk command to index PDF attachments. The package is commonly found in the system repository for Linux distributions. This not available with the Windows Recoll package at the moment.
Recoll can extract process custom XMP metadata fields, mapping them to query fields. See the manual. This needs the pdfinfo command, which usually comes in the same package as pdftotext. This is included in the Windows Recoll package.
Recoll can extract and index PDF annotations by using the poppler-glib Python bindings. These are normally available with the Poppler installation on Linux systems, by adding the Poppler GObject introspection data, which can come in variously named packages, for example gir1.2-poppler-0.18 on Ubuntu, typelib-1_0-Poppler-0_18 on OpenSuse. The actual versions may differ of course. This is not available on Windows at the moment.
The default configuration uses the evince application to open PDF files. evince has options for direct page access and pre-setting the search strings (hits will be highlighted). There are examples in the default mimeview for doing the same thing with qpdfview (+qpdfview --search %s %f#%p_). Okular does not have a search string option (but it does have a page number one). Evince is available for Windows, but you will need to install it separately.
Recoll can automatically perform OCR on pure image PDF files, and cache the results so that OCR is not needed for future indexing passes. The function can currently use either tesseract or ABBYY FineReader. See the manual. NOTE: using OCRmyPDF is probably a much better approach in most cases.