Wednesday 15 March 2017 — This is nearly eight years old. Be careful.
Have you fully understood how Python 2 and Python 3 deal with bytes and Unicode? Have you watched Pragmatic Unicode (also known as the Unicode Sandwich, or unipain) forwards and backwards? You’re a Unicode expert! Nothing surprises you any more.
Until you try IronPython...
Turns out IronPython 2.7.7 has str as unicode!
C:\Users\Ned>"\Program Files\IronPython 2.7\ipy.exe"
IronPython 2.7.7 (2.7.7.0) on .NET 4.0.30319.42000 (32-bit)
Type "help", "copyright", "credits" or "license" for more information.
>>> "abc"
'abc'
>>> type("abc")
<type 'str'>
>>> u"abc"
'abc'
>>> type(u"abc")
<type 'str'>
>>> str is unicode
True
>>> str is bytes
False
String literals work kind of like they do in Python 2: \u escapes are recognized in u”” strings, but not “” strings, but they both produce the same type:
>>> "abc\u1234"
'abc\\u1234'
>>> u"abc\u1234"
u'abc\u1234'
Notice that the repr of this str/unicode type will use a u-prefix if any character is non-ASCII, but it the string is all ASCII, then the prefix is omitted.
OK, so how do we get a true byte string? I guess we could encode a unicode string? WRONG. Encoding a unicode string produces another unicode string with the encoded byte values as code points!:
>>> u"abc\u1234".encode("utf8")
u'abc\xe1\x88\xb4'
>>> type(_)
<type 'str'>
Surely we could at least read the bytes from a file with mode “rb”? WRONG.
>>> type(open("foo.py", "rb").read())
<type 'str'>
>>> type(open("foo.py", "rb").read()) is unicode
True
On top of all this, I couldn’t find docs that explain that this happens. The IronPython docs just say, “Since IronPython is a implementation of Python 2.7, any Python documentation is useful when using IronPython,” and then links to the python.org documentation.
A decade-old article on InfoQ, The IronPython, Unicode, and Fragmentation Debate, discusses this decision, and points out correctly that it’s due to needing to mesh well with the underlying .NET semantics. It seems very odd not to have documented it some place. Getting coverage.py working even minimally on IronPython was an afternoon’s work of discovering each of these oddnesses empirically.
Also, that article quotes Guido van Rossum (from a comment on Calvin Spealman’s blog):
You realize that Jython has exactly the same str==unicode issue, right? I’ve endorsed this approach for both versions from the start. So I don’t know what you are so bent out of shape about.
I guess things have changed with Jython in the intervening ten years, because it doesn’t behave that way now:
$ jython
Jython 2.7.1b3 (default:df42d5d6be04, Feb 3 2016, 03:22:46)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_31
Type "help", "copyright", "credits" or "license" for more information.
>>> 'abc'
'abc'
>>> type(_)
<type 'str'>
>>> str is unicode
False
>>> type("abc")
<type 'str'>
>>> type(u"abc")
<type 'unicode'>
>>> u"abc".encode("ascii")
'abc'
>>> u"abc"
u'abc'
If you want to support IronPython, be prepared to rethink how you deal with bytes and Unicode. I haven’t run the whole coverage.py test suite on IronPython, so I don’t know if other oddities are lurking there.
Comments
http://www.voidspace.org.uk/ironpython/dark-corners.shtml#id12
How does coverage.py work with IronPython and Jython, which run on top of CLR and JVM? How is line tracing implemented for alternative runtimes or It Just Works? Branch coverage with AST also works?
Coverage.py supports reporting on IronPython and Jython, because they support the sys.settrace function. Reporting requires more code introspection than those platforms support, so you need to run the reporting phase under CPython.
Although IronPython's text handling is strange, we have been able to support it with Robot Framework without too much problem. CI is your friend.
http://stackoverflow.com/questions/39345916/python-returns-string-is-both-str-and-unicode-type
Coverage.py may be able to run IronPython code using pythonnet on top of CPython, like you are doing it with Jython. This should probably work even when IronPython calls into .NET dlls after `import clr` statement.
http://ironpython.net/documentation/dotnet/dotnet.html#mapping-between-python-builtin-types-and-net-types
The page you linked to doesn't say this: "IronPython departs from Python 2 semantics in these ways: ...". It doesn't even clearly say that str is unicode and that str is not bytes. You might infer that if you know that System.String is a Unicode string.
In fact, that page even says, "Usually, you do not have to think about this. However, you may sometimes have to know about it." This is incorrect. :)
However, there's still one minor difference left: Indexing into strings will deliver different results once non-BMP characters are involved, as .NET Strings are internally UTF-16, while cPython3 uses UTF-32. I'm not sure whether solving this remaining issue is even worth the efforts and performance costs, what do you think?
Add a comment: