Filenames with accents

Wednesday 1 June 2011This is over 13 years old. Be careful.

I’m working on projects for Threepress, and they have a good, extensive test suite. I was surprised when a test failed on Ubuntu that had always passed on their Macs.

The test in question was trying to open a file by name, no big deal, right? Well, in this case, the filename had an accented character, so it was a big deal. Getting to the bottom of it, I learned some new things about Python and Unicode.

On the disk is a file named lé.txt. On the Mac, this file can be opened by name, on Ubuntu, it cannot. Looking into it, the filename we’re using, and the filename it has, are different:

>>> fname = u"l\u00e9.txt".encode('utf8')
>>> fname
'l\xc3\xa9.txt'
>>> os.listdir(".")
['le\xcc\x81.txt']

On the Mac, that filename will open that file:

>>> open(fname)
<open file 'lé.txt', mode 'r' at 0x1004250c0>

On Ubuntu, not so much:

>>> open(fname)
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'l\xc3\xa9.txt'

What’s with the two different strings that seem to both represent the same text? Wasn’t Unicode supposed to get us out of character set hell by having everyone agree on how to store text? Turns out it doesn’t make everything simple, there are still multiple ways to store one string.

In this case, the accented é is represented as two different UTF-8 strings: both as ‘\xc3\xa9’ and as ‘e\xcc\x81’. In pure Unicode terms, the first is a single code point, U+00E9, or LATIN SMALL LETTER E WITH ACUTE. The second is two code points: U+0065 (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT). Turns out Unicode has both a single combined code point for accented e, and also two code points that together can mean accented é.

This demonstrates a complicated Unicode concept known as equivalence and normalization. Unicode defines complex rules that make it so that our two strings are “equivalent”.

On the Mac, trying to open the file with either string works, on Ubuntu, you have to use the same form as is stored on disk. So to open the file reliably, we have to try a number of different Unicode normalization forms to be sure to open it.

Python provides the unicodedata.normalize function which can perform the normalizations for us:

>>> import unicodedata
>>> fname = u"l\u00e9.txt"
>>> unicodedata.normalize("NFD", fname)
u'le\u0301.txt'

Unfortunately, you can’t be sure in what normalization form a filename might be. The Mac likes to create them in decomposed form, but Ubuntu seems to prefer composed form. Seems like a fool-proof file opener would need to try the four different normalization forms (NFD, NFC, NFKD, NFKC) to be sure to open a file with non-ASCII characters in it, but that also seems like a huge pain. Is it really true I have to jump through those hoops to open these files?

Comments

[gravatar]
Is it possible that a filename could contain multiple accented-e (or other such character that has multiple representations) and each could be normalized a different way? Does that require you to try '4**len(fname)' combinations?
[gravatar]
What about calling os.listdir and then do a normalized comparison (using any normalization scheme) to see if your file exists, then go and open the file using the filename read from the os.

fname = u"l\u00e9.txt"
osfname = dict([(unicodedata.normalize("NFD", unicode(f, "utf-8")), f) for f in os.listdir('.')])[unicodedata.normalize("NFD", fname)]
[gravatar]
The real problem is that ext* file systems (at least) don't enforce (nor imply) *any* encoding. A file name can be any sequence of bytes and is not guaranteed to be decodable into unicode representation at all. It makes sense then to work with file names in their exact byte-string representation as much as possible and use "unicodization" only for display in the UI. Thankfully, users usually aren't supposed to enter file names by hand.
[gravatar]
> Is it really true I have to jump through those hoops to open these files?

Yeah. Because most Unix filesystems store filenames as bytes, so the exact encoding *and* normalization format matter. Hell, you're not even guaranteed two files on the same file system will be using the same encoding, you should really treat Linux FS filenames as opaque byte sequences.

HFS+ (and NTFS) use unicode filenames instead, so it handles issues of canonical equivalence on its own.

Side note: HFS+ does not even use UTF-8 filenames for its own encoding, it uses a variant of NFD UTF-16.
[gravatar]
By the way, Python 2 and 3 on OSX will work if you just provide the unicode file name, no need to muck around with encoding at all.
[gravatar]
The problem isn't that Linux prefers a different normalized form, it's that it doesn't enforce one at the filesystem level. For example, on Mac you can only create a single file, regardless of which unicode form you use:
$ hostinfo
Mach kernel version:
	 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386
(** snip **)

$ echo $'\xc3\xa9.txt' $'e\xcc\x81.txt'
é.txt é.txt

$ echo "foo" > $'\xc3\xa9.txt'

$ echo "bar" > $'e\xcc\x81.txt'

$ ls
e??.txt
But on Linux, you can create multiple files using different forms:
$ cat /proc/version
Linux version 2.6.32.8-grsec-2.1.14-modsign-xeon-64 (root@womb) (gcc version 4.3.2 (Debian 4.3.2-1.1) ) #2 SMP Sat Mar 13 00:42:43 PST 2010

$ echo $'\xc3\xa9.txt' $'e\xcc\x81.txt'
é.txt é.txt

$ echo "foo" > $'\xc3\xa9.txt'

$ echo "bar" > $'e\xcc\x81.txt'

$ ls
e??.txt  ??.txt
The Linux case strikes me as bad design in this day and age. I get that at the byte level these are different names, but if you're going to say you support Unicode, than this shouldn't be permitted. What happens when a user asks to open "é.txt"? Which file do you get??? *ugh*

So... even if Python solves this with a bit of normalization code in it's filesystem API, great, will ruby, PHP, and Perl solve it the same way? Or do I, as the developer, have to deal with this issue anywhere I use different languages? Seems like a Python-only solution is just passing the problem further along the food chain.
[gravatar]
Its just the typical mess with linux filesystems. Just tried to fix such a problem with python on various unices (AIX being one of them) and didn't find a good solution.

While searching for a solution is stumbled over this excellent article by David A. Wheeler, basically more than you ever wanted to know of the mess:

http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

For Python you might also look at the PEP383 at http://www.python.org/dev/peps/pep-0383/, which handles the mess for Python 3.
[gravatar]
> if you're going to say you support Unicode, than this shouldn't be permitted.

Do they, though? As far as I know, most Unix filesystems make no such claim, and are pretty much all documented as handling filenames at the byte level, with no regards for semantics.
[gravatar]
@masklinn: Sorry, wasn't claiming that *nux officially supported unicode. Just expressing my belief that this has to be addressed at the core, OS API level, not at the language level. If it's going to be solved, it has to be "Linux supports Unicode", not "Python supports Unicode", or "Ruby supports Unicode". That's all. Sorry for the misunderstanding.
[gravatar]
Does this affect non-ext filesystems like tmpfs (normally mounted as /dev/shm ) too? If so it seems a bit mean to single out ext...
[gravatar]
Looks more like you're creating the file with a broken name in the first place...
The fact that OSX uses a hack because everyone else is using precomposed form doesn't mean other OSes should do the same.
[gravatar]
@mmu_man: What use cases do you have for non-normalized file names at the filesystem level?
[gravatar]
@Leo he's just trolling as a pedantic because thats the cool thing to do as a "linux man".
[gravatar]
It's worth noting that OS X's HFS filesystem also sees 'foo' and 'Foo' as the same. It's almost unique, as far as I know, in this respect.
[gravatar]
... and that this does not comply with the POSIX specification, which requires case-sensitivity. I'd like to know whether the alternative UFS(?) format provided by OS X behaves in the same way, but I don't have a spare Mac.
[gravatar]
OS X HFS+ can be formatted case-sensitive or case-insensitive on a per-volume basis. Case-insensitive is the default.
[gravatar]
Linux is garbage.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.