Saturday 9 April 2005 — This is almost 20 years old. Be careful.
Damien got a phishing email that looked like total gibberish, which he described as the Worst Phisher Ever. Turns out it wasn’t a moronic spammer, but someone clever enough to use obscure Unicode features to sneak past spam filters.
The scrambled text looked like this in his Firefox browser:
Dera Bcralays Merebm,
Tsih eamil was setn by the Balcrays svreer to virefy yruo eamil adsserd. You mtsu comelpte tsih procses by cilcking on the lkni bwole and entegnir in the samll wiwodn yruo Balcrays Mbmeership nebmur, paedocss and mbaromele wrod.
But he discovered that when viewed with IE, the text was perfectly readable:
Dear Barclays Member,
This email was sent by the Barclays server to verify your email address. You must complete this process by clicking on the link below and entering in the small window your Barclays Membership number passcode and memorable word.
Here is the actual text. Try viewing it in Firefox (scrambled) and IE (readable):
Dera Bcralays Merebm,
Tsih eamil was setn by the Balcrays svreer to virefy yruo eamil adsserd. You mtsu comelpte tsih procses by cilcking on the lkni bwole and entegnir in the samll wiwodn yruo Balcrays Mbmeership nebmur, paedocss and mbaromele wrod.
What’s going on? Confusing matters even more, if you view source in Firefox, you see scrambled text, and in IE you see readable text. How can the same series of bytes look different in the source?
Reading the page directly with readurl -x, I saw this:
000fa0: 65 22 3e 0a 0a 3c 70 3e 20 20 20 20 44 65 e2 80 e">..<p> De..
000fb0: ae 72 61 e2 80 ac 20 42 e2 80 ae 63 72 61 e2 80 .ra... B...cra..
000fc0: ac 6c 61 79 73 20 4d 65 e2 80 ae 72 65 62 6d e2 .lays Me...rebm.
000fd0: 80 ac 2c 3c 62 72 3e 3c 62 72 3e 20 20 20 20 3c ..,<br><br> <
Between the “De” and “ra” are bytes “e2 80 ae”, and after the “ra” are bytes “e2 80 ac”. This smells like UTF-8. An interactive Python prompt and the decode() function reveal the Unicode code points:
>>> for c in 'De\xE2\x80\xAEra\xE2\x80\xAC'.decode('utf-8'):
... print hex(ord(c))
...
0x44
0x65
0x202e
0x72
0x61
0x202c
So Unicode U+202E and U+202C are behind the mischief. They are Right-To-Left Override and Pop Directional Formatting respectively. The control the rendering of bidirectional text. So what’s going on here is the “D” and “e” are written left-to-right, as is usual for English, then the writing direction is switched to right-to-left, “r” and “a” are written, and the writing direction is restored to left-to-right. The result, in a renderer that properly handles these codes, is “Dear”. The result in a renderer that ignores the Unicode characters it doesn’t understand, is “Dera”. Unicode Standard Annex #9: The Bidirectional Algorithm has all the details.
Some spammers are very clever.
Comments
I wonder if there's a legitimate use for this technique.
-rich
Also worth checking whether this exploit could be used in URLs themselves... I don't think it's possible, but am still planning to raise my paranoia level up yet another notch and check source on everything.
Bracketing characters are reversed when text-direction is switched, so reversing a[b]c results in c[b]a, rather than c]b[a. Neat!
Just as an FYI, the text is scrambled by all browsers at my disposal in Mac OS X (10.3.8) - Firefox, Camino, Safari, Mozilla, Netscape, and Omni. Microsoft ceased production of IE for Mac a number of years ago when Apple began development of Safari.
Add a comment: