Parameters affecting how we generate terms and organize the index


Decide if we store character case and diacritics in the index. If we do, searches sensitive to case and diacritics can be performed, but the index will be bigger, and some marginal weirdness may sometimes occur. The default is a stripped index. When using multiple indexes for a search, this parameter must be defined identically for all. Changing the value implies an index reset.


Decide if we store the documents' text content in the index. Storing the text allows extracting snippets from it at query time, instead of building them from index position data.

Newer Xapian index formats have rendered our use of positions list unacceptably slow in some cases. The last Xapian index format with good performance for the old method is Chert, which is default for 1.2, still supported but not default in 1.4 and will be dropped in 1.6.

The stored document text is translated from its original format to UTF-8 plain text, but not stripped of upper-case, diacritics, or punctuation signs. Storing it increases the index size by 10-20% typically, but also allows for nicer snippets, so it may be worth enabling it even if not strictly needed for performance if you can afford the space.

The variable only has an effect when creating an index, meaning that the xapiandb directory must not exist yet. Its exact effect depends on the Xapian version.

For Xapian 1.4, if the variable is set to 0, we used to use the Chert format and not store the text. If the variable was 1, Glass was used, and the text stored. We don't do this any more: storing the text has proved to be the much better option, and dropping this possibility simplifies the code.

So now, the index format for a new index is always the default, but the variable still controls if the text is stored or not, and the abstract generation method. With Xapian 1.4 and later, and the variable set to 0, abstract generation may be very slow, but this setting may still be useful to save space if you do not use abstract generation at all, by using the appropriate setting in the GUI, and/or avoiding the Python API or recollq options which would trigger it.


Decides if terms will be generated for numbers. For example "123", "1.5e6",, would not be indexed if nonumbers is set ("value123" would still be). Numbers are often quite interesting to search for, and this should probably not be set except for special situations, ie, scientific documents with huge amounts of numbers in them, where setting nonumbers will reduce the index size. This can only be set for a whole index, not for a subtree.


Do not store term positions. Term positions allow for phrase and proximity searches, but make the index much bigger. In some special circumstances, you may want to dispense with them.


Determines if we index 'coworker' also when the input is 'co-worker'. This is new in version 1.22, and on by default. Setting the variable to off allows restoring the previous behaviour.


String of UTF-8 punctuation characters to be indexed as words The resulting terms will then be searchable and, for example, by setting the parameter to "%€" (without the double quotes), you would be able to search separately for "100%" or "100€" Note that "100%" or "100 %" would be indexed in the same way, the characters are there own word separators.


Process backslash as normal letter. This may make sense for people wanting to index TeX commands as such but is not of much general use.


Process underscore as normal letter. This makes sense in so many cases that one wonders if it should not be the default.


Maximum term length in bytes. Words longer than this will be discarded. The default is 40 and used to be hard-coded, but it can now be adjusted. You may need an index reset if you change the value.


Decides if specific East Asian (Chinese Korean Japanese) characters/word splitting is turned off. This will save a small amount of CPU if you have no CJK documents. If your document base does include such text but you are not interested in searching it, setting nocjk may be a significant time and space saver.


This lets you adjust the size of n-grams used for indexing CJK text. The default value of 2 is probably appropriate in most cases. A value of 3 would allow more precision and efficiency on longer words, but the index will be approximately twice as large.


External tokenizer for Korean Hangul. This allows using an language specific processor for extracting terms from Korean text, instead of the generic n-gram term generator. See for instructions.


External tokenizer for Chinese. This allows using the language specific Jieba tokenizer for extracting meaningful terms from Chinese text, instead of the generic n-gram term generator. See for instructions.


Languages for which to create stemming expansion data. Stemmer names can be found by executing 'recollindex -l', or this can also be set from a list in the GUI. The values are full language names, e.g. english, french...


Default character set. This is used for files which do not contain a character set definition (e.g.: text/plain). Values found inside files, e.g. a 'charset' tag in HTML documents, will override it. If this is not set, the default character set is the one defined by the NLS environment ($LC_ALL, $LC_CTYPE, $LANG), or ultimately iso-8859-1 (cp-1252 in fact). If for some reason you want a general default which does not match your LANG and is not 8859-1, use this variable. This can be redefined for any sub-directory.


A list of characters, encoded in UTF-8, which should be handled specially when converting text to unaccented lowercase. For example, in Swedish, the letter a with diaeresis has full alphabet citizenship and should not be turned into an a. Each element in the space-separated list has the special character as first element and the translation following. The handling of both the lowercase and upper-case versions of a character should be specified, as appartenance to the list will turn-off both standard accent and case processing. The value is global and affects both indexing and querying. We also convert a few confusing Unicode characters (quotes, hyphen) to their ASCII equivalent to avoid "invisible" search failures.

Examples: Swedish: unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl åå Åå ’' ❜' ʼ' ‐- . German: unac_except_trans = ää Ää öö Öö üü Üü ßss œoe Œoe æae Æae ffff fifi flfl ’' ❜' ʼ' ‐- . French: you probably want to decompose oe and ae and nobody would type a German ß unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl ’' ❜' ʼ' ‐- . The default for all until someone protests follows. These decompositions are not performed by unac, but it is unlikely that someone would type the composed forms in a search. unac_except_trans = ßss œoe Œoe æae Æae ffff fifi flfl ’' ❜' ʼ' ‐-


Overrides the default character set for email messages which don't specify one. This is mainly useful for readpst (libpst) dumps, which are utf-8 but do not say so.


Set fields on all files (usually of a specific fs area). Syntax is the usual: name = value ; attr1 = val1 ; [...] value is empty so this needs an initial semi-colon. This is useful, e.g., for setting the rclaptg field for application selection inside mimeview.


Use mtime instead of ctime to test if a file has been modified. The time is used in addition to the size, which is always used. Setting this can reduce re-indexing on systems where extended attributes are used (by some other application), but not indexed, because changing extended attributes only affects ctime. Notes: - This may prevent detection of change in some marginal file rename cases (the target would need to have the same size and mtime). - You should probably also set noxattrfields to 1 in this case, except if you still prefer to perform xattr indexing, for example if the local file update pattern makes it of value (as in general, there is a risk for pure extended attributes updates without file modification to go undetected). Perform a full index reset after changing this.


Disable extended attributes conversion to metadata fields. This probably needs to be set if testmodifusemtime is set.


Define commands to gather external metadata, e.g. tmsu tags. There can be several entries, separated by semi-colons, each defining which field name the data goes into and the command to use. Don't forget the initial semi-colon. All the field names must be different. You can use aliases in the "field" file if necessary. As a not too pretty hack conceded to convenience, any field name beginning with "rclmulti" will be taken as an indication that the command returns multiple field values inside a text blob formatted as a recoll configuration file ("fieldname = fieldvalue" lines). The rclmultixx name will be ignored, and field names and values will be parsed from the data. Example: metadatacmds = ; tags = tmsu tags %f; rclmulti1 = cmdOutputsConf %f