Thursday, May 29, 2008

Missing the point

While making my daily rounds of programming language-related blog posts, I came across a couple items which caught my attention. I have been very interested in parallel and concurrent programming of late, especially how to solve the big issues everyone seems worried about relating to the non-easily parallelizable code.

I saw a couple of posts, after following a couple of branches off of a Slashdot story, which seemed to confuse a few of the issues surrounding parallel programming. This one in particular, confuses a number of different problems.

Starting off, he correctly points out that:

Users don't care about parallel processing anymore than they care about how RAM works or what a hash table is. They care about getting their work done.

Assuming that he is not talking about "people writing programs" as users, he is absolutely correct. As long as something works well and fast enough, nobody cares.

But therein lies the problem: well and fast enough. The "well" part is fairly simple: if you are doing multiprocessing, it still has to work. That's pretty obvious, and while it can be challenging at times, there is no real controversy over that fact.

This leaves the "fast enough" part. The problem here is that since the dawn of time (which according to my computer is January 1, 1970), people have been able to count on future computers getting faster. Moore's law and all. Nowadays, computers get faster by adding more cores, but software is still written assuming that we will get this hardware speedup. The problem is that the hardware guys are tired of giving the software guys gigantic speed boosts without forcing the software guys to change their behavior at all. They are holding a worker's revolution and throwing off the shackles of the oppressors and saying, "you want more speed, write your programs more parallely!"

He and the guy here do mention the problem of playing music while surfing the web and checking email (which are often both in the same program, but whatever). On a single core system, playing a song, unzipping a file, and running a browser would lead to some whole-system slowdown, and this problem was greatly helped by multi-core computers, nobody is arguing that. The problem is that this process-level parallelism only helps you until you have more cores than processes, which really isn't that far off. Think about how many programs you are running now. Aside from OS background services which spend the vast majority of their time asleep, there are probably between 2 and 5.

Let's say that you are playing a movie, browsing a flashy website, talking to someone over VoIP, and compiling your latest project. That's four processes. You will see great improvements in responsiveness and speed all the way up to 4 cores, but after that, you will be sitting idle. If each of those programs is single-threaded, or its main CPU-intensive portion is single-threaded, then adding more cores won't help you at all.

This is the point that the author of the second article misses. There may be 446 threads on his system, but, as he observes, many of them are for drawing the GUI or doing some form of I/O. Drawing a GUI takes relatively little processor power compared with something like encoding a video (unless you are running Vista, har har), and for I/O threads, most of what they are doing looks like this:

while (! is_burnt_out(sun)) {
data = wait_until_ive_gotten_something();
copy(data, buffer);

In other words, not really doing anything with the CPU. This means that, although there are a great number of threads, only a few of them (Firefox does all of its rendering and JavaScript execution in a single thread for all open windows and tabs) actually will be using the processor. The "crisis" that everyone is talking about is when those single computation threads start to get overloaded. With so many things moving to the browser, what good is a 128-core CPU if all of my precious web apps all run in Firefox's single thread? I have these 768 cores sitting around, wouldn't it be nice to use more than 2 of them for video encoding?

Just Make The Compilers Do It

One thing that the second articled brings up is automatically-parallelizing compilers. I do think that there is something to be said for these, especially since it has been shown over and over, first with assembly and then with garbage collection, that compilers and runtimes will get smarter than programmers at doing things like emitting assembly or managing memory. I would not rule out the chance that a similar thing will happen with parallelization.

I do think that making parallelizing compilers will not be as "easy" as writing good optimizing compilers or good garbage collectors. The problem is that the compiler would have to have a broad enough view of a serial program to be able to "figure out" how it can be broken up and what parts can be run concurrently to be able to generate very concurrent code. This would go beyond inlining functions or unrolling loops to figuring out when to spawn new threads and how and when to add locks to serial code to extract maximum performance. Far be it from me to say that we will "never" come up with something that smart, but I seriously doubt that we will be able to code exactly as before and have our compiler do all the magic for us.

Save Us, Oh Great One!

So what do we do? The answer is not yet clear. There are clearly a lot of problems with traditional multithreading with locks (a la pthreads and Java Threads), but nobody seems to agree on a clear better way of doing things. I saw a cool video here about a lock-free hash table. The idea of using a finite state machine (FSM) for designing a concurrent program is fascinating, but I could see problems with data structures involving more dependent elements, like a balanced tree. Still, I think the approach gives a good insight on one way to progress.

A similar, but less... existing idea is that of COSA, a system by renowned and famous crackpot Louis Savain. He uses an interesting, but basically unimplemented graphical model for doing concurrent programming and talks about the current problems with concurrent code.

Now, judging from the tone of this article and the titles of some of his other posts (Encouraging Mediocrity at the Multicore Association, Half a Century of Crappy Computing), he seems to be 90% troll, 10% genius inventor. I took a brief look at his description of COSA, and it seems to have some similar properties to the (much more legitimate) lock-free hash table. The idea of modelling complex program behavior in a series of state machines or in a big graph, but a lot would need to be done to make this into a practical system.

As interesting as COSA looks, the author seems to be too much of a Cassandra for it to gain any appeal. I mean "Cassandra" in the "can see the future, but nobody believes him," sense, not the "beauty caused Apollo to grant her the gift of prophecy" part (but who knows, I've never seen the guy).

I see some evolution of parts of each of these ideas as being a potential solution for some of the multiprocessing problems. The guy who made the lock-free hash table was working on the level of individual CASs, which would definitely need to change for a widely used language. Just as "goto" was too low-level and was replaced with "if", "for", and "map", some useful abstraction over CAS will come along and enable concurrent programming at a higher level of abstraction.