Unix and non-ASCII file names, a summary of issues

Unix/Linux file and directory names are binary byte C strings. Only the null byte and the slash character (/) are forbidden inside a name, nowhere does the kernel interpret the strings as meaningful or printable.

In the old times, all utilities that would display to the user were ASCII-based, and people would use pure printable ASCII file names (even using space characters inside names was a cause for trouble). Non alphanumeric characters were exclusively used for playing tricks on colleagues. And all was well.

Then the devil came under the guise of accented 8 bit characters. The system has no problem with them, file names are still binary C strings, but the utilities have to display them or take them as input, and, because there is no encoding specification stored with the file names, they can only do this according to the character encoding taken from the user’s current locale.

For example fr_FR.UTF-8, and fr_FR.ISO8859-1 could be used simultaneously on the same system (by different users), but they are completely uncompatible: ISO-8859-1 strings are illegal when viewed in an UTF-8 locale (will display as interrogation points or some other conventional error marker). UTF-8 strings will display as gibberish in an ISO-8859-1 locale.

This means that the file names created by an UTF-8 user are displayed as garbage to the ISO-8859 one…​

If you ever change your locale, your old files are still there and named the same (in the binary sense), but the names display badly and you have great trouble inputing them. If you add distributed (NFS) file system issues, things become totally unmanageable. Also think about archives sent from another system with a different encoding.

For what concerns Recoll:

  • The file names inside recoll.conf are not transcoded, they are taken as binary strings (mostly, only \n and space are a bit special), and passed as is to the system. So if you edit 'recoll.conf' with a text editor, inside the same locale that is or has been used for file names, you’ll be fine.

  • There was a bug in the GUI configuration tool, up to 1.12, it should transcode between the internal Qt format and locale-dependant strings, but it doesn’t or does it badly.

  • There is also an exception for the unac_except_trans variable, this has to be UTF-8, so if the rest of the file uses another encoding, you’ll need to edit two separate files and concatenate them.

As of version 1.13, Recoll uses local8Bit()/fromLocal8Bit() to convert recoll.conf file names from/to QStrings (it uses UTF-8 for all string values which are not file names).

The Qt file dialog is broken (at least was, I have not checked this on recent versions). It should consider file paths as almost-binary data, not QStrings, but doesn’t. In consequence, things are even more broken than necessary as seen from there:

With LANG="C", no non-ASCII paths can’t be used at all:

  • Strings read from recoll.conf are stripped of 8bit characters before display.

  • Directory entries with 8bit characters are not displayed at all in the selection dialog.

With LANG="fr_FR.UTF-8", only UTF-8 paths can be used:

  • Strings read from recoll.conf are damaged when converted to QString (except those that were actually UTF-8)

  • Only the UTF-8 directory entries are displayed in the selection dialog.

With LANG="fr_FR.iso8859-1", everything works ok.

  • Strings read from recoll.conf are displayed with weird characters if they use another encoding such as UTF-8, but are correctly maintained and can be read back from the dialogs and rewritten without damage.

  • Directory entries with 8 bit characters are displayed weirdly (normal), but can be manipulated without trouble (this includes utf-8 names of course).

In conclusion, only the iso-8859 locales can be used for handling mixed encoding situations. This is a possible workaround for people who need it.