Wednesday, February 15, 2012

Set ordering in Python: why I didn't sleep last night

In yesterday's class of the Python-infused Linear Algebra course I'm taking, a fellow student noted the strange ordering of items in a set when printed. It seemed that the printed order of the set items always swapped the first two items, but left the rest unchanged. I don't know what sets he tried, but I was interested in the general behavior, and so began my quest through the Python source code...


BIG IMPORTANT WARNING!!!

These are some interesting properties of the order of iterations over Python's set objects. They are NOT to be relied upon for the correctness of your program! There are a bunch of hardcoded, opaque magic numbers which determine the order, so it could easily change across micro version, platforms, or compilations with different options. Treat sets as having undefined ordering!

All of this is looking at the tag "v3.1.1" in the cpython repository on the Python repositories site. "3.1.1" is the version of python3 on the computers here, so I figured that would be the one to check, though I assume the set code hasn't changed that much. The root of that version is here.

Sets are defined in the source files "Include/setobject.h" and "Objects/setobject.c" for the headers and implementation, respectively. The relevant part of sets is that their elements are stored in a hash table using open addressing for collision resolution. Python (version 3.1.1, anyway) uses the following algorithm for resolving collisions, in Python (with cheats):

entry is actually a pointer (the whole function is in C); I'm fudging the specifics here a lot. mask is table.array_size() - 1 where table.array_size() is the size of the underlying C array. The essence is the updating of index; it is intended to distribute keys all over the array, making collisions less likely. There are a bunch of trade-offs, and they are described in "dictnotes.txt" and "dictobject.c". One interesting fact is that basic integers hash to themselves (and are ANDed with mask for indexing into the array).

The use of open addressing means it can get all objects simply by walking the table. Entries in the set are defined as:

The set object itself comprises an array of setentries and some additional fields, like the number of in-use entries and the size of the array.

Calling list(o) converts a "sequence object" (there is an abstracted interface at the C level) by calling PySequence_List, declared in "Include/abstract.h" and implemented in "Objects/abstract.c". PySequence_List does some basic checks and calls _PyList_Extend, in "Objects/listobject.c", which creates a new list and calls listextend(self, iter(o)) for the real work. It then simply walks through the iterator, appending the items returned to itself, doubling its length if it runs out of space. So the ordering is determined by the iterator for sets.

The work is done in the setiter_iternext() function, which walks through the backing array until it finds a non-NULL entry and returns that. So the order of objects from the result of list(s) is determined by the order of elements in the table. That is determined by PySet_Add which calls set_add_key which uses PyObject_Hash to get the hash of the object to add.

PyObject_Hash looks into the object's vtable (Python doesn't call it that, it's stored in the object's ob_type field) to find its tp_hash function and calls that. tp_hash returns a long. Each object has its own hash function; this is what is used for a couple of common types:

  • double: uses the integer part of the float (with some extra stuff)
  • pointer: rotates the pointer right by 4 bits (since the lower 3-4 bits are likely to be zero)
  • int: the absolute value of the int modulo the platform's maximum long value. This actually can get interesting, since Python longs are arbitrary precision. The hash is computed through some additions, rotations, and fun stuff like that.
  • tuple: the hashes of its objects, XORed together and multiplied by a sequence of constants to mix up the value, and something having to do with primes.
  • string: XOR of bytes (not "characters"! Know your Unicode!).

So the order items from list(s) is "mostly" stable for sets containing the same objects. The exception is when objects with the same hash are inserted in different orders; because of open addressing, the order of insertion determines their place in the array. The following case should hold true:

So it is possible for sets with the same contents to have different orderings, but it requires a collision of hash(a) % mask. Remember, (for ints) hash(i) == i when i is not an arbitrary-precision numeric. You can create a collision easily by exploiting knowledge of the initial size of the internal array. Rather cleverly, the set object includes an immediate array of PySet_MINSIZE elements (defaulting to 8) within the structure itself. This means that if your set doesn't grow beyond 5 elements (the hard-coded load factor is 2/3), there is no extra memory allocated for the table. Given those two facts, all you need is (a, b) such that a & 7 == b & 7. How about (0, 8)?

You can then do:

I'm not sure why the order of the elements is reversed; this only happens for certain combinations:

I started investigating deeper into how Python fills its hash tables, but I decided to stop after a dozen hours, reading through Python's parser code, and trying to figure out pixel offsets to pass to PyCairo during visualization. My progress thus far is here. I'm trying to get a visualization of exactly how the entries are being added, but porting Brandon Rhodes' dictionary code to work for sets is non-trivial. Also, getting the visualization to display in a Gtk window rather than writing to a file took some fiddling around with PyGtk and PyCairo.

tl;dr: the order of objects in a set when converted to a list (such as during display on the repl) is dependent on the hash of the object and the collision resolution strategy. The pattern found by the student in which the first two elements of the set were reversed was a coincidence, as the indexing function attempts to distribute keys uniformly over the table.

I spent WAY too much time on this. Interesting, though.