Special considerations for big indexes

This only needs concern you if your index is going to be bigger than around 5 GBytes. Beyond 10 GBytes, it becomes a serious issue. Most people have much smaller indexes. For reference, 5 GBytes would be around 2000 bibles, a lot of text. If you have a huge text dataset (remember: images don't count, the text content of PDFs is typically less than 5% of the file size), read on.

The amount of writing performed by Xapian during index creation is not linear with the index size (it is somewhere between linear and quadratic). For big indexes this becomes a performance issue, and may even be an SSD disk wear issue.

The problem can be mitigated by using the following mitigations:

  • Partition the data set and create several indexes of reasonable size rather than a huge one. These indexes can then be queried in parallel (using the Recoll external indexes facility), or merged using xapian-compact.

  • Have a lot of RAM available and set the idxflushmb Recoll configuration parameter as high as you can without swapping (experimentation will be needed). 200 would be a minimum in this context.

  • Use Xapian 1.4.10 or newer, as this version brought a significant improvement in the amount of writes.

Recoll versions 1.38 and newer have an option to use multiple temporary indexes and a final merge internally. This may be a simple solution for the index size issue, but it may not provide enough control over the temporary indexes physical placement for really huge datasets