Multithreading for document preparation

The Recoll indexing process recollindex can use multithreading to speed up indexing on multiprocessor systems. This is currently enabled on MacOS systems and Unix-like systems systems, but not under Windows.

The data processing used to index files is divided in several stages and some of the stages can be executed by multiple threads. The stages are:

  1. File system walking: this is always performed by the main thread.

  2. File conversion and data extraction.

  3. Text processing (splitting, stemming, etc.).

  4. Xapian index update.

You can also read a longer document about the transformation of Recoll indexing to multithreading.

The threads configuration is controlled by two configuration file parameters.

thrQSizes

This variable defines the job input queues configuration. There are three possible queues for stages 2, 3 and 4, and this parameter should give the queue depth for each stage (three integer values). If a value of -1 is used for a given stage, no queue is used, and the thread will go on performing the next stage. In practise, deep queues have not been shown to increase performance. A value of 0 for the first queue tells Recoll to perform autoconfiguration (no need for anything else in this case, thrTCounts is not used) - this is the default configuration.

thrTCounts

This defines the number of threads used for each stage. If a value of -1 is used for one of the queue depths, the corresponding thread count is ignored. It makes no sense to use a value other than 1 for the last stage because updating the Xapian index is necessarily single-threaded (and protected by a mutex).

Note

If the first value in thrQSizes is 0, thrTCounts is ignored.

The following example would use three queues (of depth 2), and 4 threads for converting source documents, 2 for processing their text, and one to update the index. This was tested to be the best configuration on the test system (quadri-processor with multiple disks).

            thrQSizes = 2 2 2
            thrTCounts =  4 2 1
          

The following example would use a single queue, and the complete processing for each document would be performed by a single thread (several documents will still be processed in parallel in most cases). The threads will use mutual exclusion when entering the index update stage. In practise the performance would be close to the precedent case in general, but worse in certain cases (e.g. a Zip archive would be performed purely sequentially), so the previous approach is preferred. YMMV... The 2 last values for thrTCounts are ignored.

            thrQSizes = 2 -1 -1
            thrTCounts =  6 1 1
          

The following example would disable multithreading. Indexing will be performed by a single thread.

thrQSizes = -1 -1 -1