Wednesday 1 June 2011 — This is over 13 years old. Be careful.
I’m working on projects for Threepress, and they have a good, extensive test suite. I was surprised when a test failed on Ubuntu that had always passed on their Macs.
The test in question was trying to open a file by name, no big deal, right? Well, in this case, the filename had an accented character, so it was a big deal. Getting to the bottom of it, I learned some new things about Python and Unicode.
On the disk is a file named lé.txt. On the Mac, this file can be opened by name, on Ubuntu, it cannot. Looking into it, the filename we’re using, and the filename it has, are different:
>>> fname = u"l\u00e9.txt".encode('utf8')
>>> fname
'l\xc3\xa9.txt'
>>> os.listdir(".")
['le\xcc\x81.txt']
On the Mac, that filename will open that file:
>>> open(fname)
<open file 'lé.txt', mode 'r' at 0x1004250c0>
On Ubuntu, not so much:
>>> open(fname)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'l\xc3\xa9.txt'
What’s with the two different strings that seem to both represent the same text? Wasn’t Unicode supposed to get us out of character set hell by having everyone agree on how to store text? Turns out it doesn’t make everything simple, there are still multiple ways to store one string.
In this case, the accented é is represented as two different UTF-8 strings: both as ‘\xc3\xa9’ and as ‘e\xcc\x81’. In pure Unicode terms, the first is a single code point, U+00E9, or LATIN SMALL LETTER E WITH ACUTE. The second is two code points: U+0065 (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT). Turns out Unicode has both a single combined code point for accented e, and also two code points that together can mean accented é.
This demonstrates a complicated Unicode concept known as equivalence and normalization. Unicode defines complex rules that make it so that our two strings are “equivalent”.
On the Mac, trying to open the file with either string works, on Ubuntu, you have to use the same form as is stored on disk. So to open the file reliably, we have to try a number of different Unicode normalization forms to be sure to open it.
Python provides the unicodedata.normalize function which can perform the normalizations for us:
>>> import unicodedata
>>> fname = u"l\u00e9.txt"
>>> unicodedata.normalize("NFD", fname)
u'le\u0301.txt'
Unfortunately, you can’t be sure in what normalization form a filename might be. The Mac likes to create them in decomposed form, but Ubuntu seems to prefer composed form. Seems like a fool-proof file opener would need to try the four different normalization forms (NFD, NFC, NFKD, NFKC) to be sure to open a file with non-ASCII characters in it, but that also seems like a huge pain. Is it really true I have to jump through those hoops to open these files?
Comments
fname = u"l\u00e9.txt"
osfname = dict([(unicodedata.normalize("NFD", unicode(f, "utf-8")), f) for f in os.listdir('.')])[unicodedata.normalize("NFD", fname)]
Yeah. Because most Unix filesystems store filenames as bytes, so the exact encoding *and* normalization format matter. Hell, you're not even guaranteed two files on the same file system will be using the same encoding, you should really treat Linux FS filenames as opaque byte sequences.
HFS+ (and NTFS) use unicode filenames instead, so it handles issues of canonical equivalence on its own.
Side note: HFS+ does not even use UTF-8 filenames for its own encoding, it uses a variant of NFD UTF-16.
So... even if Python solves this with a bit of normalization code in it's filesystem API, great, will ruby, PHP, and Perl solve it the same way? Or do I, as the developer, have to deal with this issue anywhere I use different languages? Seems like a Python-only solution is just passing the problem further along the food chain.
While searching for a solution is stumbled over this excellent article by David A. Wheeler, basically more than you ever wanted to know of the mess:
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
For Python you might also look at the PEP383 at http://www.python.org/dev/peps/pep-0383/, which handles the mess for Python 3.
Do they, though? As far as I know, most Unix filesystems make no such claim, and are pretty much all documented as handling filenames at the byte level, with no regards for semantics.
The fact that OSX uses a hack because everyone else is using precomposed form doesn't mean other OSes should do the same.
Add a comment: