Hunting a random() bug

Thursday 14 February 2013This is 12 years old. Be careful.

At edX, we have Python behind the scenes in courses to initialize the state of problems presented to students. Often, these problems are randomized so that different students will see different details in quantitative problems, but each student’s random seed is saved so that the student will see the same problem if they revisit the page.

The seed is used to seed the random module before executing any chunk of course Python, so that you can simply use the random module and know that you’ll get an appropriate value.

Today I found code like this in a course:

import q
random.seed(the_seed)
# ... generate the problem ...

My task was to refactor how information flowed around, and the_seed wasn’t going to be available, so I asked why the code was like this. It seemed odd, because the random module had just been seeded before this code was invoked, so why had the author bothered to re-seed the module with the same seed?

The answer was that it was a mysterious bug from months ago where the first time the code was run, it would produce a different result than any other time, and the re-seeding solved it. The q import seemed to be messing with the random seed, but only the first time.

The “only first time” clue pointed to it being code that is run on import. Remember, Python modules are just a series of statements, and when you import a module, it really executes all the statements. There’s no “import mode” that just collects function definitions. If you write a statement with a side effect at the top level of a module, that side effect will happen when you import the module.

But statements in module are only executed the first time the module is imported in a process. Subsequent imports simply produce another reference to the existing module object. Everything pointed to a statement running during import which stomped on the random module.

The q module imported a number of other modules, including numpy and sympy. But why would importing a module re-seed the random module?

A little experimenting showed that sympy was at fault here:

Python 2.7.3 (default, Aug  1 2012, 05:16:07) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> # Some baseline data
>>> import random
>>> random.seed(17)
>>> random.random()
0.5219839097124932
>>> random.random()
0.8066907771186791
>>> random.random()
0.9604947743238768
>>> random.random()
0.2896253777644655
>>> 
>>> # Re-seed, and import sympy
>>> random.seed(17)
>>> import sympy
>>> random.random()
0.8066907771186791
>>>

Looking at the values, after importing sympy, we’ve skipped ahead one number in our random sequence. So sympy isn’t re-seeding the generator, it’s consuming a random number.

To find out where, we resorted to a monkey-patching trick: Replace random.random with a booby-trap:

Python 2.7.3 (default, Aug  1 2012, 05:16:07) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> random.random = lambda: 1/0
>>> import sympy
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/venv/lib/python2.7/site-packages/sympy/__init__.py", line 20, in <module>
    from sympy.core import *
  File "/venv/lib/python2.7/site-packages/sympy/core/__init__.py", line 8, in <module>
    from expr import Expr, AtomicExpr
  File "/venv/lib/python2.7/site-packages/sympy/core/expr.py", line 2020, in <module>
    from mul import Mul
  File "/venv/lib/python2.7/site-packages/sympy/core/mul.py", line 1197, in <module>
    from numbers import Rational, igcd
  File "/venv/lib/python2.7/site-packages/sympy/core/numbers.py", line 1993, in <module>
    from function import FunctionClass
  File "/venv/lib/python2.7/site-packages/sympy/core/function.py", line 43, in <module>
    from sympy.utilities import default_sort_key
  File "/venv/lib/python2.7/site-packages/sympy/utilities/__init__.py", line 13, in <module>
    from runtests import test, doctest
  File "/venv/lib/python2.7/site-packages/sympy/utilities/runtests.py", line 472, in <module>
    class SymPyTests(object):
  File "/venv/lib/python2.7/site-packages/sympy/utilities/runtests.py", line 475, in SymPyTests
    seed=random.random()):
  File "<stdin>", line 1, in <lambda>
ZeroDivisionErrorinteger division or modulo by zero

OK, not sure why it’s importing its tests when I try to use the package, but looking at the code, here’s the culprit:

class SymPyTests(object):
    def __init__(self, ..., seed=random.random()):
        #...
        self._seed = seed

Here we can see the problem. Remember that function arguments are computed once, when the function is defined. Since this function is defined when the module is imported, random.random() will be called during import, consuming one of our random numbers.

Better would be to define it like this:

class SymPyTests(object):
    def __init__(self, ..., seed=None):
        #...
        self._seed = seed
        if self._seed is None:
            self._seed = random.random()

I’m not quite sure which behavior the author wanted, one seed for all the instances, or one seed per instance. I know I don’t want importing this module to change my random number sequence.

Amusingly enough, the behavior of the initializer is irrelevant, it’s only called in one place, and never defaults the seed argument:

def test(*paths, **kwargs):
    ...
    seed = kwargs.get("seed", None)
    if seed is None:
        seed = random.randrange(100000000)
    t = SymPyTests(..., seed)

The best solution for our code would be to not rely on the module-level random number sequence, and instead use our own Random object. Come to think of it, that’s what sympy should do too.

BTW, looking at why sympy is importing test infrastructure when I import it, there’s this in sympy/utilities/__init__.py:

"""Some utilities that may help.
"""
from iterables import (flatten, group, take, subsets,
    variations, numbered_symbols, cartes, capture, dict_merge,
    postorder_traversal, preorder_traversal, interactive_traversal,
    prefixes, postfixes, sift, topological_sort)

from lambdify import lambdify
from source import source

from decorator import threaded, xthreaded

from runtests import test, doctest

from cythonutils import cythonized
from timeutils import timed

from misc import default_sort_key

This makes using utilities very convenient, since it contains everything at the top level. But the downside is it means you must always take everything. There is no way to import only part of utilities. Even if you use “from utilities.lambdify import lambdify,” Python will execute the utilities/__init__.py file, importing everything.

Lessons:

  • Modules really execute when imported,
  • Function defaults are computed once when the function is defined,
  • Modifying global state like the random module’s implicit sequence is bad,
  • Importing stuff into __init__.py makes your code monolithic and harder to use as you want, and
  • Monkey-patching can be a great way to debug problems.

Comments

[gravatar]
So many good lessons here for python developers. Also monkey patching random to allow you to easily see where the next random call was happening is a neat trick, I will definitely be using that. Great post.
[gravatar]
Thanks for pointing this out and for the nice booby-trapping example!

I've made the change to SymPy (development version).

https://github.com/sympy/sympy/pull/1785
[gravatar]
In general, I would avoid using random.random, as it is very easy to get issues like this. Basically, any module that uses it will mess up everyone else if they are trying to seed. Even worse, if one module seeds random, then it will make it no longer "random" for every other module that uses it.

Instead, you should use random.Random, which seeds without global state. We should also change SymPy to use this as well.
[gravatar]
Ah, I didn't noticed that you said the same thing in your blog post.

If you have any suggestions on how to improve SymPy's import structure, we'd love to hear it. We still want `from sympy import *` to import everything useful, because using SymPy interactively is very common. Will changing everything to use __all__ make things better?
[gravatar]
Also, I didn't know that edX was using SymPy. That's really cool.
[gravatar]
Incidentally, some caveats about seeding the Python RNG, if you want predictable sequences:

• Don't use string seeds, or the sequence will not be predictable (Python 3.2 uses a different hashing method, while the hash in Python < 3.2 returns different results on 32-bit and 64-bit machines).

• Don't use randrange(), randint(), choice(), or shuffle(), because the algorithms differ in different Python versions. Only random.random() is guaranteed to return the same numbers for the same seed.

Aside: I took a long time to verify what I wrote above. When I finally hit the Preview button, I got the edit box back with an error message above it saying "You took a long time entering this post. Please preview it and submit it again." Hey, I was trying to preview!)
[gravatar]
@Aaron + Chris: thanks for taking such quick action. It great to see open-source working so well.

@Marius: thanks for the heads up about repeatability, it makes total sense, though I'd never considered it before. (Sorry about the preview snafu.. spam prevention...)
[gravatar]
@Aaron, I would suggest looking at the way that Flask and Werkzeug have their import structure set up.

https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/__init__.py
[gravatar]
So basically it does delayed importing. Are there any restrictions on this? I wouldn't be surprised if this doesn't really do much for SymPy, because almost all the modules are tied to one another in non-trivial ways. But it's worth a shot.
[gravatar]
Your booby-trapping example is very neat!

btw, it's just about syntax cosmetics but what do you think about writing:
self._seed = random.random() if seed is None else seed
or
self._seed = seed or random.random()
[gravatar]
@guilload: I don't like the if-expression form, I find it too compact. The second form is fine with me, though you will get into a debate about whether zero is a valid explicit seed or not.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.