The Recoll Python API gives easy access to the text contents of all the documents in the index, which makes it well fitted as a language model input. In turn there are two directions for using this: Retrieval Augmented Generation (RAG), where a generative LLM output is informed by the result documents from another (probably keyword-based) search, and semantic queries, where we look for documents matching the concepts in the query, and not necessarily the exact terms.

RAG operations are centered in the LLM interface and probably need little from Recoll apart from getting data from the index.

There is a new branch in the Recoll source with slight modification to allow interfacing with a language model for the purpose of running semantic searches.

In practise, all the specific language model work is performed by Python programs outside of the main recoll source. One of the scripts is a worker for the Recoll GUI, receiving questions and providing results. In consequence, the Recoll main code needed very minimal modifications, mostly handling the additional search type in the GUI.

ollama is used to run the models, and chromadb to store the embeddings.

The general process is as follows:

The Python part runs in a virtual environment. A simple shell script allows building it easily.

I only have a modest PC to run this, with no GPU. Running a reranking model on this has proven impossible (see comments in rclsem_query.py), so that the current implementation is rather primitive and mostly provides the scaffolding for experimentation.

The modification of the main Recoll GUI code is minimal, so it has been merged into the master branch (it used to be on a semantic branch, the same code is now in master). In practise, if you want to try this:

This will install ollama if it’s not already there, pull the nomic-embed-text model, create the Python virtual environment, install ollama and chromadb in it, and copy the Recoll scripts and Python module.

There are a number of configuration variables which must be set in the main configuration file for the index you will be using:

Voila. If you have a GPU, and can code a little Python, I think that the most interesting direction would be experimenting with reranking.

As a vaguely interesting aside, and from my initial testing on the works of Jane Austen, it appears that horseriding is a much more frequent subject in Mansfield Park than in the other books…​ It’s all the more interesting that horseriding is of course not an English word, and a normal Recoll search would yield no results at all. It remains to be seen if the feature can be useful for anything beyond litterary 'research'.