Character case and diacritic marks (2), user interface

In a previous document, we discussed some of the problems which arise when mixing case/diacritics sensitivity and stemming.

As of version 1.18, Recoll can create two types of indexes: * Dumb indexes contain terms which are lowercased and stripped of diacritics. Searches using such an index are naturally case- and diacritics- insensitive: search terms are stripped before processing. * Raw indexes contain terms which are just like they were found in the source document. Searching such an index is naturally sensitive to case and diacritics, and can be made insensitive by further processing.

The following explains how users can control these Recoll features.

Controlling the type of index we create: stripped or raw

The kind of index that recoll creates is determined by:

A build-time configure switch: --enable-stripchars. If this is set, the code for case and diacritics sensitivity is not compiled in and recoll will work like the previous versions: unaccented and casefolded index, no runtime options for case or diacritics sensitivity
An indexing configuration switch (in recoll.conf): if Recoll was built with --disable-stripchars, this will provide a dynamic way to return to the "traditional" index. The case and diacritics code will be present but inactive. Normally, a recoll installation with this switch set should behave exactly like one built with --enable-stripchars. When using multiple indexes, this switch MUST be consistent between indexes. There is no support whatsoever for mixing raw and dumb indexes. The option is named indexStripChars, and it is not settable from the GUI to avoid errors. This is something that would typically be set once and for all for a given installation. We need to decide what the default value will be for 1.18
A number of query time switches. Using these it is also possible to perform a search insensitive to case and diacritics on a raw index. Note however, that, given the complexity of the issues involved, I give no guaranty at this time that this will yield exactly the same results as searching a dumb index. Details about query time behaviour follow.

Controlling stem, case and diacritics expansion: user query interface

Recoll versions up to 1.17 were insensitive to case and diacritics. We only needed to give the user a way to control stem expansion. This was done in three ways:

Globally, by setting a menu option.
Globally, by setting the stemming language value to empty.
On a term by term basis by Capitalizing the term, or, in query language mode only, by using an 'l' clause modifier ("term"l).

After switching to an unstripped index, capable of case and diacritic sensitivity, we need ways to control what processing is performed among:

Case expansion.
Diacritics expansion.
Stem expansion.

The default mode will be compatible with the previous version, because this is is most generally what we want to do: ignore case and diacritics, expand stems.

There are two easy approaches for controlling the parameters: * Global options set in the GUI menus or as recollq command line switches. * Per-clause options set by modifiers in the query language.

We would like, however to let the user entry automatically override the defaults in a sensible way. For example:

If a term is entered with diacritics, diacritic sensitivity is turned on (for this term only).
If a term is entered with upper-case characters, case sensitivity is turned on. In this case, we turn off stem expansion, because it makes really no sense with case sensitivity.

With this method we are stuck with 3 problems (only if the global mode is set to insensitive, and we’re not using the query language):

Turning off stemming without turning on case sensitivity.
Searching for an all lower-case term in case-sensitive mode.
Searching for a term without diacritics in diacritic-sensitive mode.

The two latter issues are relatively marginal and can be worked around easily by switching to query language mode or using negative clauses in the advanced search.

However, we need to be able to turn stemming off while remaining insensitive to case, and we need to stay reasonably compatible with the previous versions. This means that a term which has a capital first letter but is otherwise lowercase will turn stemming off, but not case sensitivity on.

So we’re left with how to search for such a term in a case-sensitive way, and for this, you’ll have to use global options or the query language.

The modified method is:

If a term is entered with diacritics, diacritic sensitivity is turned on (for this term only).
If the first letter in a term is upper-case and the rest is lower-case, we turn stem expansion off, but we do not become case-sensitive
If any letter in a term except the first is upper-case, case sensitivity is turned on. Stem expansion is also turned-off (even if the first letter is lower-case), because it makes really no sense with case sensitivity.
To search for an all lower-case or capitalized term in a case-sensitive way, use the query language: "Capitalized"C, "lowercase"C
Use the query language and the "D" modifier to turn on diacritics sensitivity.

It can be noted that some combinations of choices do not make sense and they are not allowed by Recoll: for example, diacritics or case sensitivity do not make sense with stem expansion (which cannot preserve diacritics in any meaningful general way).

The [[ZDevCaseAndDiacritics3.wiki|next page]] describes the actual implementation in Recoll 1.18.