Using OpenAI Whisper to transcribe speech to text

At some point between 1.34.2 and 1.35, the handler gained the capability to use the OpenAI Whisper program for transcribing speech to text. The text is then extracted in the main text and indexed.

The feature reuses the Recoll now misnamed OCR cache, so that the transcription is only run once, even when resetting the index or moving the files around.

To enable the feature:

(example on Ubuntu, you will have to adapt a little for other systems)

  • Install ffmpeg

    sudo apt install ffmpeg
  • Install PyTorch:

    pip3 install torch
  • Install OpenAI Whisper:

    pip3 install git+
  • Add the following to recoll.conf:

    speechtotext = whisper
    sttmodel = small
    #sttdevice =

Set a value for sttdevice if you have an appropriate graphic card, else Whisper will run on the CPU.

Maybe check that things work by using whisper on the command line:

whisper --language=en --model=small /some/audio/file.mp3

You can then index away. Maybe try on a small subset first…​