IronPython is weird

Wednesday 15 March 2017This is nearly eight years old. Be careful.

Have you fully understood how Python 2 and Python 3 deal with bytes and Unicode? Have you watched Pragmatic Unicode (also known as the Unicode Sandwich, or unipain) forwards and backwards? You’re a Unicode expert! Nothing surprises you any more.

Until you try IronPython...

Turns out IronPython 2.7.7 has str as unicode!

C:\Users\Ned>"\Program Files\IronPython 2.7\ipy.exe"
IronPython 2.7.7 (2.7.7.0) on .NET 4.0.30319.42000 (32-bit)
Type "help", "copyright", "credits" or "license" for more information.
>>> "abc"
'abc'
>>> type("abc")
<type 'str'>
>>> u"abc"
'abc'
>>> type(u"abc")
<type 'str'>
>>> str is unicode
True
>>> str is bytes
False

String literals work kind of like they do in Python 2: \u escapes are recognized in u”” strings, but not “” strings, but they both produce the same type:

>>> "abc\u1234"
'abc\\u1234'
>>> u"abc\u1234"
u'abc\u1234'

Notice that the repr of this str/unicode type will use a u-prefix if any character is non-ASCII, but it the string is all ASCII, then the prefix is omitted.

OK, so how do we get a true byte string? I guess we could encode a unicode string? WRONG. Encoding a unicode string produces another unicode string with the encoded byte values as code points!:

>>> u"abc\u1234".encode("utf8")
u'abc\xe1\x88\xb4'
>>> type(_)
<type 'str'>

Surely we could at least read the bytes from a file with mode “rb”? WRONG.

>>> type(open("foo.py", "rb").read())
<type 'str'>
>>> type(open("foo.py", "rb").read()) is unicode
True

On top of all this, I couldn’t find docs that explain that this happens. The IronPython docs just say, “Since IronPython is a implementation of Python 2.7, any Python documentation is useful when using IronPython,” and then links to the python.org documentation.

A decade-old article on InfoQ, The IronPython, Unicode, and Fragmentation Debate, discusses this decision, and points out correctly that it’s due to needing to mesh well with the underlying .NET semantics. It seems very odd not to have documented it some place. Getting coverage.py working even minimally on IronPython was an afternoon’s work of discovering each of these oddnesses empirically.

Also, that article quotes Guido van Rossum (from a comment on Calvin Spealman’s blog):

You realize that Jython has exactly the same str==unicode issue, right? I’ve endorsed this approach for both versions from the start. So I don’t know what you are so bent out of shape about.

I guess things have changed with Jython in the intervening ten years, because it doesn’t behave that way now:

$ jython
Jython 2.7.1b3 (default:df42d5d6be04, Feb 3 2016, 03:22:46)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.8.0_31
Type "help", "copyright", "credits" or "license" for more information.
>>> 'abc'
'abc'
>>> type(_)
<type 'str'>
>>> str is unicode
False
>>> type("abc")
<type 'str'>
>>> type(u"abc")
<type 'unicode'>
>>> u"abc".encode("ascii")
'abc'
>>> u"abc"
u'abc'

If you want to support IronPython, be prepared to rethink how you deal with bytes and Unicode. I haven’t run the whole coverage.py test suite on IronPython, so I don’t know if other oddities are lurking there.

Comments

[gravatar]
This is because IronPython `str` is really CLR type of `System.String`:

http://www.voidspace.org.uk/ironpython/dark-corners.shtml#id12

How does coverage.py work with IronPython and Jython, which run on top of CLR and JVM? How is line tracing implemented for alternative runtimes or It Just Works? Branch coverage with AST also works?
[gravatar]
@Denis, thanks for the link, but somehow the words "str" and "unicode" don't appear on that page. I find it odd how little is said about this.

Coverage.py supports reporting on IronPython and Jython, because they support the sys.settrace function. Reporting requires more code introspection than those platforms support, so you need to run the reporting phase under CPython.
[gravatar]
You actually can create bytes on IronPython using `b'xxx'` syntax or `bytes()` built-in. On IronPython `str is unicode` but `bytes is not str`. Kind of like on Python 3.
IronPython 2.7.7 (2.7.7.0) on .NET 4.0.30319.42000 (32-bit)
Type "help", "copyright", "credits" or "license" for more information.
>>> b = b'foo'
>>> type(b)
<type 'bytes'>
>>> b
b'foo'
>>> bytes(u'hyvä', encoding='utf-8')
b'hyv\xc3\xa4'
>>> str is unicode
True
>>> bytes is str
False
>>>
It is strange that `u'hyvä'.encode('utf-8')` yields `str`, not `bytes`, though. Hopefully they just got IronPython 3 out and all this stuff fixed.

Although IronPython's text handling is strange, we have been able to support it with Robot Framework without too much problem. CI is your friend.
[gravatar]
@Pekka: I was able to use bytes() to get past these issues. But tox doesn't seem able to run under IronPython. I'm claiming coverage.py works, and wait for bug reports to prove me wrong :) If you know ways to run tox and pytest under IronPython, I'd be interested.
[gravatar]
Unfortunately I cannot help with that. The main reason we use the plain-old-unittest, not pytest, is to avoid fighting with too many external dependencies when we support Python, Jython, IronPython and PyPy.
[gravatar]
@Ned, yes the article skips unicode vs str altogether because underlying datatype System.String is unicode:

http://stackoverflow.com/questions/39345916/python-returns-string-is-both-str-and-unicode-type

Coverage.py may be able to run IronPython code using pythonnet on top of CPython, like you are doing it with Jython. This should probably work even when IronPython calls into .NET dlls after `import clr` statement.
[gravatar]
@Denis: "because underlying datatype System.String is unicode" is the reason that IronPython behaves this way. But it's not a reason to leave the fact out of the documentation. How was I supposed to discover this major departure from the semantics of Python 2?
[gravatar]
It's clear to me what the root of this problem is. The IronPython developers are thinking about a .NET developer who is interested in writing some Python code. I have a different problem: I have some Python code that needs to run on IronPython.

The page you linked to doesn't say this: "IronPython departs from Python 2 semantics in these ways: ...". It doesn't even clearly say that str is unicode and that str is not bytes. You might infer that if you know that System.String is a Unicode string.

In fact, that page even says, "Usually, you do not have to think about this. However, you may sometimes have to know about it." This is incorrect. :)
[gravatar]
Actually, this is most likely to change when IronPython upgrades to Python 3 (which is in the works, contributions welcome!), as Python 3 string semantics are much closer to IronPython / .NET string semantics.

However, there's still one minor difference left: Indexing into strings will deliver different results once non-BMP characters are involved, as .NET Strings are internally UTF-16, while cPython3 uses UTF-32. I'm not sure whether solving this remaining issue is even worth the efforts and performance costs, what do you think?

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.