Wednesday, August 11, 2010

On the value of knowing the lower levels

While digging around some of the Python documentation (read: procrastinating), I stumbled across this section comparing Python performance to C in some very simple and contrived "benchmarks." The point of the section was that Python is not C, and you should be careful when trying to apply knowledge of C to Python. On a simple test of adding i to itself 10 million times, Python was about 50 times slower than C (which isn't really surprising, given that the C code just executes a series of single assembly addl instructions).

But this isn't about C vs Python (since those trade-offs are obvious, well-known, and boring in 2010), but about C vs... well, C. Taking a look at the code, I saw this:

I know what you're thinking, "Great, it adds 47 to itself to get 94 500,000,000 times in a row. This is so fascinating, only a dissertation by Lady GaGa on the readability of her poker face could be more laden with philosophical insights and epiphanies of the greater meaning of the cosmos." By itself it isn't really much of anything.

However, can we make that C version even faster? One of the immediate problem's you'll hit with any optimization from GCC is that it quickly zaps the loop right out of existence (since it can prove that removing it has no external impact on the application). I simplified the program a bit to remove all that extra verbosity:

Ah, much better! From my old C days, I remembered the good old register keyword could speed things up on some occasions. Like so many things C, the register keyword means little more than, "hey GCC, if you're not too busy or tired, I'd really like this variable to sit in a register, rather than memory. Oh, you don't feel like it right now? That's fine, registers are silly anyway, forget I ever brought it up. What's that? Why yes, of course I'll get off your lawn." But for old times' sake, I chucked it in there to get this radically-different piece of code:

For shits and giggles, I benchmarked the two head-to-head and wound up with this:

Yes, you read that right: just over 2 seconds for the non-registered version and around 0.3 seconds for the register-beggar. Running these each not nearly enough times to achieve statistical significance, I got pretty stable numbers, so it's not just a fluke.

What is really going on here? To get a better idea, we have to dive down into the actual assembly output of GCC. Luckily, by slapping a little -S on the compiler options, GCC dumps the assembly version of each program. Here is the "memory-based" version:

And here is the "registerized" version:

To get a clearer idea of the differences, here is a diff between the two assembly outputs:

changing those five(six in the register version) little instructions seem to cause a nearly ten-fold increase in speed, so what is going on?

First off, let me say that I am no 30-year veteran of x86 programming (or even a 3-year vet), so I'm almost certainly going to get some of the details wrong, but the general gist of things remains the same.

It turns out that really the only instructions that matter are those immediately after the labels .L2 and .L3. These (along with the jle .L3) are the only instructions being run within the loop (including the loop guard). The loop begins when the program jumps to label .L2. From there, it compares the loop counter (-4(%ebp)) with 499,999,999 (one less than the upper bound of the loop) and jumps back to .L3 if the loop counter is smaller. We then add 1 to the loop counter and continue on our way.

But wait, why does the memory version take so much longer? Notice that the register version uses -4(%ebp) for both the addl and cmpl instructions, whereas the register version simply uses %ebx. It may not look like much, but what is going on there is a memory access (to a variable on the stack) in the memory version and a direct register access in the register version. We have a loop with two memory reads and a single write (since addl has to read the memory location, add 1 to that value, and then write it back out), and this cuts performance by a factor of 10!

One of the big things they drove home in the multiprocessor synchronization course I took was that how nicely you play along with the memory hierarchy has huge implications for performance. This is doubly true when you're talking about multiprocessor contention over a lock or shared memory where the various processors could be invalidating each others' caches if you're not careful, causing heavy cache-coherency traffic.

Long story short, these impossibly low-level details can add up to make a huge difference, especially if you're paying for them hundreds of millions of times!