This only needs concern you if your index is going to be bigger than around 10 GBytes. Most people have much smaller indexes. For reference, 10 GBytes would be around 4000 bibles, a lot of text. If you have a huge text dataset (remember: images don't count, the text content of PDFs is typically less than 5% of the file size), read on.
Recoll (thanks to Xapian) can manage huge indexes: in 2025, we heard of a 550 GB, 11+ million documents index. Big indexes just need a bit of thinking ahead and organisation (and appropriate hardware).
The amount of writing performed by Xapian during index creation is not linear with the index size (it is somewhere between linear and quadratic). For big indexes this becomes a performance issue, and may even be an SSD disk wear issue.
The problem can be mitigated by using the following approaches:
Partition the data set and create several indexes of smaller size rather than a huge one. These indexes can then be queried in parallel (using the Recoll external indexes facility), or merged using xapian-compact.
Have a lot of RAM available and set the
idxflushmb
Recoll configuration parameter as high as you can without swapping (experimentation will be needed). 200 would be a bare minimum in this context.Use Xapian 1.4.10 or newer, as this version brought a significant improvement in the amount of writes.
Recoll versions 1.38 and newer have an option to use multiple temporary indexes and a final merge internally. This was designed as a CPU performance optimization (increasing parallelism), but it may also provide a simple solution for the index size issue, though it may not give enough control over the temporary indexes physical placement for really huge datasets.