Using multiple temporary indexes to improve indexing time

In some cases, either when the input documents are simple and require little processing (e.g. HTML files), or possibly with a high number of available cores, the single-threaded Xapian index updates can become the performance bottleneck for indexing.

In this case, it is possible to configure the indexer (Recoll 1.38 and later) for using multiple temporary indexes which are merged at the end of the operation. This can provide a huge gain in performance, but, as opposed to multithreading for document preparation, it can also have a (slight) negative impact in some cases, so that it is not enabled by default.

The parameter which controls the number of temporary indexes in recoll.conf is named thrTmpDbCnt. The default value is 0, meaning that no temporary indexes are used.

If your document set is big, and you are using a processor with many cores for indexing, especially if the input documents are simple, it may be worth it to experiment with the value. For example, with a partial Wikipedia dump (many HTML small files), indexing times could be divided almost by three, by using four temporary indexes on a quad-core machine. More detail in this article on the Recoll WEB site.

All the tests were performed on SSDs, it is quite probable that this approach would not work well on spinning disks, at least not in its current form.