Term synonyms and text search: in general, there are two main ways to use term synonyms for searching text:
At index creation time, they can be used to alter the indexed terms, either increasing or decreasing their number, by expanding the original terms to all synonyms, or by reducing all synonym terms to a canonical one.
At query time, they can be used to match texts containing terms which are synonyms of the ones specified by the user, either by expanding the query for all synonyms, or by reducing the user entry to canonical terms (the latter only works if the corresponding processing has been performed while creating the index).
With one exception, Recoll only uses synonyms at query time. A user query term which part
of a synonym group will be optionally expanded into an OR
query for all
terms in the group.
The one exception is that if the idxsynonyms
parameter is set during indexing, and if the file contains multi-word synonyms, a multi-word
single term will be emitted for every occurrence found in the text. If the same file is in use
at query time, this will allow phrase and proximity searches to work for the multi-word
synonyms.
Synonym groups are defined inside ordinary text files. Each line in the file defines a group.
Example:
hi hello "good morning" # not sure about "au revoir" though. Is this english ? bye goodbye "see you" \ "au revoir"
As usual, lines beginning with a #
are comments,
empty lines are ignored, and lines can be continued by ending them with
a backslash.
Multi-word synonyms are supported, but be aware that these will generate phrase queries, which may degrade performance and will disable stemming expansion for the phrase terms.
The contents of the synonyms file must be casefolded (not only
lowercased), because this is what expected at the point in the query
processing where it is used. There are a few cases where this makes a
difference, for example, German sharp s should be expressed as
ss
, Greek final sigma as sigma. For reference,
Python3 has an easy way to casefold words (str.casefold()).
The synonyms file can be specified in the Search parameters tab of the GUI configuration Preferences menu entry, or as an option for command-line searches.
Once the file is defined, the use of synonyms can be enabled or disabled directly from the Preferences menu.
The synonyms are searched for matches with user terms after the latter are stem-expanded, but the contents of the synonyms file itself is not subjected to stem expansion. This means that a match will not be found if the form present in the synonyms file is not present anywhere in the document set (same with accents when using a raw index).
The synonyms function is probably not going to help you find your letters to Mr. Smith. It is best used for domain-specific searches. For example, it was initially suggested by a user performing searches among historical documents: the synonyms file would contains nicknames and aliases for each of the persons of interest.