Recoll overview

Recoll uses the Xapian information retrieval library as its storage and retrieval engine. Xapian is a very mature package using a sophisticated probabilistic ranking model.

The Xapian library manages an index database which describes where terms appear in your document files. It efficiently processes the complex queries which are produced by the Recoll query expansion mechanism, and is in charge of the all-important relevance computation task.

Recoll provides the mechanisms and interface to get data into and out of the index. This includes translating the many possible document formats into pure text, handling term variations (using Xapian stemmers), and spelling approximations (using the aspell speller), interpreting user queries and presenting results.

In a shorter way, Recoll does the dirty footwork, Xapian deals with the intelligent parts of the process.

The Xapian index can be big (roughly the size of the original document set), but it is not a document archive. Recoll can only fully display documents that still exist at the place from which they were indexed. However, recent Recoll version do store the plain text from all indexed documents.

Recoll stores all internal data in Unicode UTF-8 format, and it can index many types of files with different character sets, encodings, and languages into the same index. It can process documents embedded inside other documents (for example a PDF document stored inside a Zip archive sent as an email attachment...), down to an arbitrary depth.

By default. Recoll processes east asian texts by generating terms as arbitrary sequences of consecutive characters (n-grams). However, it has provisions to integrate with language-aware text segmenters for Chinese and Korean which will produce a smaller index and improved search.

Stemming is the process by which Recoll reduces words to their radicals so that searching does not depend, for example, on a word being singular or plural (floor, floors), or on a verb tense (flooring, floored). Because the mechanisms used for stemming depend on the specific grammatical rules for each language, there is a separate Xapian stemmer module for most common languages where stemming makes sense.

Recoll stores the unstemmed versions of terms in the main index and uses auxiliary databases for term expansion (one for each stemming language), which means that you can switch stemming languages between searches, or add a language without needing a full reindex.

Storing documents written in different languages in the same index is possible, and commonly done. In this situation, you can specify several stemming languages for the index.

Recoll currently makes no attempt at automatic language recognition, which means that the stemmer will sometimes be applied to terms from other languages with potentially strange results. In practise, even if this introduces possibilities of confusion, this approach has been proven quite useful, and it is much less cumbersome than separating your documents according to what language they are written in.

By default, Recoll strips most accents and diacritics from terms, and converts them to lower case before either storing them in the index or searching for them. As a consequence, it is impossible to search for a particular capitalization of a term (US / us), or to discriminate two terms based on diacritics (sake / saké, mate / maté).

Recoll can optionally store the raw terms, without accent stripping or case conversion. In this configuration, default searches will behave as before, but it is possible to perform searches sensitive to case and diacritics. This is described in more detail in the section about index case and diacritics sensitivity.

Recoll uses many parameters to define exactly what to index, and how to classify and decode the source documents. These are kept in configuration files. A default configuration is copied into a standard location (usually something like /usr/share/recoll/examples) during installation. The default values set by the configuration files in this directory may be overridden by values set inside your personal configuration. With the default configuration, Recoll will index your home directory with generic parameters. Most common parameters can be set by using configuration menus in the recoll GUI. Some less common parameters can only be set by editing the text files.

The indexing process is started automatically (after asking permission), the first time you execute the recoll GUI. Indexing can also be performed by executing the recollindex command. Recoll indexing is multithreaded by default when appropriate hardware resources are available, and can perform multiple tasks in parallel for text extraction, segmentation and index updates.

Searches are usually performed inside the recoll GUI, which has many options to help you find what you are looking for. However, there are other ways to query the index: