Recoll for Chinese

Overview

By default, Recoll uses an n-gram text tokenizer to process Chinese. This generates terms from single characters an consecutive sequences of 2 characters (by default), and produces a lot of meaningless terms.

Recoll 1.37 and later have the possibility to use the Jieba chinese text segmentation software instead. This results in a smaller index and better search terms.

The necessary modules are not bundled with the base Recoll installation, you will just need to install the Jieba Python3 extension, and update the configuration file with the following:

chinesetagger = Jieba

Slow performance

The Jieba-based tokenizer is much slower than the "stupid" one based on n-grams, both because it has more work to do and because this work is performed in a separate Python process (implying a huge amount of process switches during the indexing). The slow performance remains reasonable on Linux (especially in the default multithreaded mode), but it becomes a major issue on Windows.

Here follow the indexing times for around 6 MB of Chinese epubs (no images) downloaded from gutenberg.org. The indexing is performed on a Lenovo T580 laptop (Intel core i7-8550U). The Windows version is running in a VirtualBox virtual machine.

Linux

multi-threaded

n-grams

8 seconds

Jieba

13 seconds

single-threaded

n-grams

12 seconds

Jieba

30 seconds

Windows

single-threaded

n-grams

29 seconds

Jieba

92 seconds

I am not sure of why the indexing is so slow on Windows. Maybe this is because of VirtualBox, but I doubt it. Anyway, a comparison on a "real" machine would be interesting.

Installing Jieba for Recoll on Windows

Install Jieba: as of recoll 1.37.0, the embedded Python distribution which comes with recoll includes the pip module. Use the following in a command window to install Jieba:
```
"c:\Program Files (x86\)\Recoll\Share\filters\python\python.exe" -m pip install jieba
```
Edit the Recoll index configuration file (default: C:\Users\[me]\Appdata\Local\Recoll\recoll.conf) with, e.g., Notepad, and add the following line:

chinesetagger = Jieba

Reset the index.

Installing Jieba on Linux

Depending on your distribution, the Python3 Jieba interface may be packaged or not. On Debian, the package name is python3-jieba. You have a choice to install the module through the system packager, or by using pip3, either in personal or system-wide mode, which really a matter of taste and circumstances. For example, to install system-wide with pip:

sudo python3 -m pip install jieba

Edit recoll.conf, add:

chinesetagger = Jieba

Reset the index.