Character case and diacritic marks (3), implementation

In previous pages, we discussed diacritics and stemming, and an appropriate interface for switchable search sensitivity to diacritics and character case.

So you are in this mood again and you don’t want to type accents (maybe you’re stuck with a QWERTY American english keyboard), or conversely you’re want to resume looking for your résumé, and you’ve told Recoll as much, using the appropriate interface. What happens then ?

The second case is easy if the index is raw, and mostly impossible if it is stripped. So we’ll concentrate on the first case: how to achieve case and diacritics insensitivity on a raw index ?

Recoll uses three expansion tables:

  • The first table has stripped and lowercased terms as keys and raw terms as data: mate -> (mate, maté, MATE,...).

  • The second table has lowercased stems as keys and original lowercase terms as data (when using multiple languages, there are several such tables): évit -> (éviter, évite, évitâmes, ...).

  • The third table has stripped and lowercased stems as keys and stripped lowercased terms as data: evit -> (eviter, evite, evitons) and evitam -> (evitames, ...)

The first table can be used for full case and diacritics expansion or for only one of those, by post-filtering the results of full expansion (e.g. if we only want diacritics expansion, we filter by stripping diacritics from each result term and check that it’s identical to the input). For example if we have mate -> (mate, maté, MATE, MATÉ) in the table and want to only perform case expansion for an input of maté, we apply case folding to the initial output and keep only maté, as mate differs from the input.

We only perform stemming expansion when case and diacritics sensitivity is off. It is performed using the second and third tables, both on the lowercased and lowercased/stripped output of the first step, and each term in the output stemming is expanded again for case (using the first table).

A full example of the expansion occurring during an insensitive search for resume using French stemming on a mixed English/French index follows. An important thing to remember is that the result of each expansion is a function of the terms actually present in the index, not some arbitrary computation (and so, of course, many of the possible but absent variations are missing).

The case and diacritics expansion of resume yields +RESUME Resume

Résumé resumé résume résumé resume+

The Stem expansion input list (lower-cased) is:

+resume resumé résume résumé+, and the output is:
+resum resume resumenes resumer resumes resumé resumée résum résumait
résumant résume résumer résumerai résumerait résumes résumez résumé résumée
résumées résumés+

Each of the above terms is then fed to case and diacritics expansion (first

table), for the final output:
+resume résumé Résumé résumer résume Resume résumés RESUME resumes
resumer résumant resúmenes resumé résumait résumes résumée resumee
résumerait Résumez résumerai RÉSUMÉES Resumée Resumes résumées+.

A Xapian OR query is finally constructed from the expanded term list.