Recoll command execution performance

Abstract

Introduction

The Recoll indexer, recollindex, is a big process which executes many others, mostly for extracting text from documents. Some of the executed processes are quite short-lived, and the time used by the process execution machinery can actually dominate the time used to translate data. This document explores possible approaches to improving performance without adding excessive complexity or damaging reliability.

Studying fork/exec performance is not exactly a new venture, and there are many texts which address the subject. While researching, though, I found out that not so many were accurate and that a lot of questions were left as an exercise to the reader.

Issues with fork

The traditional way for a Unix process to start another is the fork()/exec() system call pair.

fork() duplicates the process address space and resources (open files etc.), then duplicates the thread of execution, ending up with 2 mostly identical processes.

exec() then replaces part of the newly executing process with an address space initialized from an executable file, inheriting some of the resources under various conditions.

This was all fine with the small processes of the first Unix systems, but as time progressed, processes became bigger and the copy-before-discard operation was found to waste significant resources. It was optimized using two methods (at very different points in time):

The first approach was to supplement fork() with the vfork() call, which is similar but does not duplicate the address space: the new process thread executes in the old address space. The old thread is blocked until the new one calls exec() and frees up access to the memory space. Any modification performed by the child thread persists when the old one resumes.
The more modern approach, which coexists with vfork(), was to replace the full duplication of the memory space with duplication of the page descriptors only. The pages in the new process are marked copy-on-write so that the new process has write access to its memory without disturbing its parent. This approach was supposed to make vfork() obsolete, but the operation can still be a significant resource consumer for big processes mapping a lot of memory, so that vfork() is still around. Programs can have big memory spaces not only because they have huge data segments (rare), but just because they are linked to many shared libraries (more common).

Note

Orders of magnitude: a recollindex process will easily grow into a few hundred of megabytes of virtual space. It executes the small and efficient antiword command to extract text from ms-word files. While indexing multiple such files, recollindex can spend 60% of its CPU time doing fork()/exec() housekeeping instead of useful work (this is on Linux, where fork() uses copy-on-write).

Apart from the performance cost, another issue with fork() is that a big process can fail executing a small command because of the temporary need to allocate twice its address space. This is a much discussed subject which we will leave aside because it generally does not concern recollindex, which in typical conditions uses a small portion of the machine virtual memory, so that a temporary doubling is not an issue.

The Recoll indexer is multithreaded, which may introduce other issues. Here is what happens to threads during the fork()/exec() interval:

fork():
- The parent process threads all go on their merry way.
- The child process is created with only one thread active, duplicated from the one which called fork()
vfork()
- The parent process thread calling vfork() is suspended, the others are unaffected.
- The child is created with only one thread, as for fork(). This thread shares the memory space with the parent ones, without having any means to synchronize with them (pthread locks are not supposed to work across processes): caution needed !

Note

for a multithreaded program using the classical pipe method to communicate with children, the sequence between the pipe() call and the parent close() of the unused side is a candidate for a critical section: if several threads can interleave in there, children process may inherit descriptors which 'belong' to other fork()/exec() operations, which may in turn be a problem or not depending on how descriptor cleanup is performed in the child (if no cleanup is performed, pipes may remain open at both ends which will prevents seeing EOFs etc.). Thanks to StackExchange user Celada for explaining this to me.

For multithreaded programs, both fork() and vfork() introduce possibilities of deadlock, because the resources held by a non-forking thread in the parent process can’t be released in the child because the thread is not duplicated. This used to happen from time to time in recollindex because of an error logging call performed if the exec() failed after the fork() (e.g. command not found).

With vfork() it is also possible to trigger a deadlock in the parent by (inadvertently) modifying data in the child. This could happen just because of dynamic linker operation (which, seriously, should be considered a system bug).

In general, the state of program data in the child process is a semi-random snapshot of what it was in the parent, and the official word about what you can do is that you can only call async-safe library functions between fork() and exec(). These are functions which are safe to call from a signal handler because they are either reentrant or can’t be interrupted by a signal. A notable missing entry in the list is malloc().

These are normally not issues for programs which only fork to execute another program (but the devil is in the details as demonstrated by the logging call issue…).

One of the approaches often proposed for working around this mine-field is to use an auxiliary small process to execute any command needed by the main one. The small process can just use fork()/exec() with no performance issues. This has the inconvenient of complicating communication if a lot if data needs to be transferred one way or another.

The posix_spawn() Linux non-event

Given the performance issues of fork() and tricky behaviour of vfork(), a "simpler" method for starting a child process was introduced by Posix: posix_spawn().

The posix_spawn() function is a black box, externally equivalent to a fork()/exec() sequence, and has parameters to specify the usual house-keeping performed at this time (file descriptors and signals management etc.). Hiding the internals gives the system a chance to optimize the performance and avoid vfork() pitfalls like the ld.so lockup described in the Oracle article.

The Linux posix_spawn() is implemented by a fork()/exec() pair by default.

vfork() is used either if specified by an input flag or no signal/scheduler/process_group changes are requested. There must be a reason why signal handling changes would preclude vfork() usage, but I could not find it (signal handling data is stored in the kernel task_struct).

The Linux glibc posix_spawn() currently does nothing that user code could not do. Still, using it would probably be a good future-proofing idea, but for a significant problem: there is no way to specify closing all open descriptors bigger than a specified value (closefrom() equivalent). This is available on Solaris and quite necessary in fact, because we have no way to be sure that all open descriptors have the CLOEXEC flag set.

So, no posix_spawn() for us (support was implemented inside recollindex, but the code is normally not used).

The chosen solution

The previous version of recollindex used to use vfork() if it was running a single thread, and fork() if it ran multiple ones.

After another careful look at the code, I could see few issues with using vfork() in the multithreaded indexer, so this was committed.

The only change necessary was to get rid of an implementation of the lacking Linux closefrom() call (used to close all open descriptors above a given value). The previous Recoll implementation listed the /proc/self/fd directory to look for open descriptors but this was unsafe because of of possible memory allocations in opendir() etc.

Another Recoll implementation feature is now helping this issue by avoiding it: the original version always executed short lived external commands to extract data. This was changed to mostly using (almost) permanent Python processes which communicate with the main one. Depending on the type of document being processed, the Python code either executes an external command (not a problem because the Python process is reasonably small), or just processes the file if the format makes it possible.

Test results

Table 1. Indexing 12500 small .doc files
call	real	user	sys
fork	0m46.025s	0m26.574s	0m39.494s
vfork	0m18.223s	0m17.753s	0m1.736s
spawn/fork	0m45.726s	0m27.082s	0m40.575s
spawn/vfork	0m18.915s	0m18.681s	0m3.828s
recoll 1.18	1m47.589s	0m21.537s	0m29.458s

No surprise here, given the implementation of posix_spawn(), it gets the same times as the fork()/vfork() options.

The tests were performed on an Intel Core i5 750 (4 cores, 4 threads), around 2015.

It would be painful to play it safe and discard the 60% reduction in execution time offered by using vfork(), so this was adopted for Recoll 1.21. To this day, no problems were discovered, but, still crossing fingers…

The last line in the table is just for the fun: recollindex 1.18 (single-threaded) needed almost 6 times as long to process the same files…