This Week in Hax: 2008

Tuesday, October 7, 2008

On Linux on the Desktop

This is a message I sent in response to an email from my dad. He was commenting on the recent story about the high return rates for Linux Netbooks.

I was reading more about this company that was selling the cheap Linux notebook, and getting lots of returns.

I think this was in reference to Netbooks (not "notebooks," these are smaller and lower powered than a traditional laptop) being sold by MSI. MSI is primarily a hardware company that targets OEMs and people who would assemble their own computers (like me) and so doesn't have much experience making usable operating systems.

Apparently, they (foolishly) decided to use a custom Linux distribution rather than something like Ubuntu. It seems like the problem is more one of MSI giving people a bad installation rather than an inherent problem of Linux.

This particular article was pointing out that people who bought this in the first place were more adventurous and more knowledgeable than most computer users to begin with, but still could not deal with it.

Again, I think that this was largely a problem of a poor install than a fundamental problem of Linux.

I would amend to say "Linux is a great choice for extremely technically sophisticated users who prefer being as far as possible from the mainstream."

That has certainly been the traditional user-base, and is still a significant part of the development community (Richard Stallman refuses to use the *Web*), but there is effort going into changing that. With the increasing popularity of Linux on these Netbooks (would this story even have been possible a few years ago?) as well as cell phones, there is a lot of effort going into usability improvements.

The great thing about open source programs is that it is very hard for useful programs to "just die." If a commercial program loses its corporate overlord, it can fade out and whither away. If a company gets bought up or out-competed, applications can disappear. This has been the Microsoft strategy. The reason they are so scared of Linux and open source is that even if you kill every developer of open source programs, the code is still there, and anyone with the knowledge and inclination can work on it.

Since most open source programs don't have the burden of needing to make money off of their direct sale, they tend not to get worse for the sake of adding features. Look at any version of Norton after around 2003. They needed people to keep buying the program, so they needed to add /something/ to make it different from the previous version. The problem is that it basically already did most of what it needed to do, so they had to add un-features and made it worse than it was, to the point where a computer was better off without Norton than with it.

For open source programs, if they reach maturity, people will maintain them, but not add features for the sake of making more money, since the developers generally don't make money directly off of the sale of the program. This means that programs which are basically done don't try to add useless features.

Also, it seems like investments in technologies and frameworks provide more of a network-effect benefit within the open source world than in the proprietary world. Open source has been playing catch-up for a while, but it is starting to pull ahead, with the web browser space is the most dramatic example. While it took a while for Firefox to reach parity with Internet Explorer, the current version of Firefox is much faster and more featureful than the latest IE, and the development version of both Firefox and WebKit (the engine that powers Safari) have javascript execution engines up to 40x faster than IE.

The level of polish and development that constitutes "acceptable" is not static, but it is not moving as fast as the development of the open source ecosystem. Ubuntu is usable for many people's day-to-day tasks already, and is only getting more usable. As time goes on, it will become acceptably easy for an increasing number of people.

Linux can take over from Windows, but they need to make it easy. And for that they have a ways to go.

Agreed, but there has been huge progress within the last few years. And it shows no signs of slowing.

It is sort of like in my field we generate many different kinds of images. The neurologists complain that the labelling of the images is inconsistent so "they can't tell what they are looking at". We never look at the labels, because it is obvious from a glance what kind of image it is. So to us an elaborate system to produce consistent labels would be so useless as to be a waste of whatever time it took to implement. To the neurologists not having it is a problem. If I were as into Linux as I am into brain images, then I suspect I would find GUI as useless as you do. As it is, I have fewer demands on my computer, but "labor saving" is at the top of the list.

I think this is what has kept the GUI less newbie-friendly than that of Mac or certain aspects of Windows (I would assert that Ubuntu is more user-friendly in many aspects than Windows, but not as familiar to many people). It is easier for a commercial company to hire usability experts and compel interface designers to produce good GUIs than for a group of hobbyists who don't mind a CLI to spontaneously produce a good GUI.

The instructions may not always be clear, but they do not run to "type the following with exactly this syntax, except, of course, changing the part you need to change."

I will have to say that it is a lot easier to tell someone:

Type this:

ps aux | grep zfs

and copy and paste the results to me

than it is to say,

Open task manager, try to find each entry with the string 'zfs' in it, and tell me what appears in each column.

On the other hand, it is a lot easier to click the "Applications" menu and then look through the "Accessories" or "Games" menu rather than memorizing that your chess game is launched with "glchess" or that "Manage Passwords and Encryption Keys" is launched with "seahorse." It all depends on what you're trying to do and how much up-front time you are willing to invest in order to save time later.

Saturday, June 7, 2008

Really Really Simple Syndication

From the everything is miscellaneous and web that wasn't talks, it seems like what we really need is some way for computers to figure out what information we want and give it to us. There is a huge amount of info encoded in blogs, the example of nearly one blog post for each word in a Bush speech about immigration, but so far, there is no good way of extracting much of it. Sure, we can google for "blogs about bush's immigration speech," but even that would likely turn up a bunch of blogs about random things, a bunch of pages about unrelated immigration topics, and so on.

The first step would be to be able to do something like "tell me all of the instances in which bush has changed his position on immigration" which would look at all the blogs which referenced the speech and found things relating to a change in the stated position. This would probably require natural language understanding of queries, but that might not even be necessary, as it could simply relate the words in the query with the words in the blogs.

The next thing needed would be for it to be anonymous (or at least have that possibility) and distributed. Aside from privacy concerns, there could be massive scalability problems, a la Twitter. There is the concept of a distributed hash table, but that only works for exact matches. Wikipedia to the rescue, with this (PDF) bit of magic from Berkley.

Personalized feeds

However, this would only be the beginning. Once you can respond to general natural-language queries like that, you could build up a list of what someone was interested in. RSS feeds are a fascinating concept and allow people to get a feed of news customized for their tastes, but it assumes that everything at a certain address will be interesting to me.

If you go to the RSS page on the New York times, you will see that they provide feeds for "Business," "Arts," "Automobiles," and so on. Going down further, there are feeds for just "Media & Advertising" "World Business" under business and "Design" and "Music" for arts, but nothing under automobiles. This is the kind of problem that David Weinberger was talking about: what if all "Arts" are pretty much the same for me, but I want to differentiate between "Foreign Cars" and "Domestic Cars," or even "Sports Cars" and "Trucks." Maybe I don't care about red cars at all, so I don't ever want to see a story about red cars in my feed.

The problem with RSS is that, even though it is a huge step forward for allowing us to keep track of many different sources of news at once, someone else has to decide how to split up the feeds. Most places give you an option of what you might want; the current system the New York Times is much better than having one giant "this is everything" feed, but it still has a ways to go. This could be done without having to rewrite our RSS clients by allowing the user of a site to set up a custom RSS URL from which to pull updates, but this would place a huge burden on the content providers and would not scale at all.

The solution

So after we have a good way for the user to tell the system what to get, we would want a way for the system to learn what the user likes, possibly using a Netflix-like recommendation system, and automatically pull down stories that the user likes.

If President Bush gives a new speech about policy for the Internets and 1,000 people blog about it, our little system should automatically sift through all of the posts, figure out which parts of the blogs would likely interest you, based on what you have told it you like in the past, and present you with some sort of reasonable compilation of the information, all without any user interaction. This would really unlock the potential of the Internet. Get hacking.

Thursday, May 29, 2008

Missing the point

While making my daily rounds of programming language-related blog posts, I came across a couple items which caught my attention. I have been very interested in parallel and concurrent programming of late, especially how to solve the big issues everyone seems worried about relating to the non-easily parallelizable code.

I saw a couple of posts, after following a couple of branches off of a Slashdot story, which seemed to confuse a few of the issues surrounding parallel programming. This one in particular, confuses a number of different problems.

Starting off, he correctly points out that:

Users don't care about parallel processing anymore than they care about how RAM works or what a hash table is. They care about getting their work done.

Assuming that he is not talking about "people writing programs" as users, he is absolutely correct. As long as something works well and fast enough, nobody cares.

But therein lies the problem: well and fast enough. The "well" part is fairly simple: if you are doing multiprocessing, it still has to work. That's pretty obvious, and while it can be challenging at times, there is no real controversy over that fact.

This leaves the "fast enough" part. The problem here is that since the dawn of time (which according to my computer is January 1, 1970), people have been able to count on future computers getting faster. Moore's law and all. Nowadays, computers get faster by adding more cores, but software is still written assuming that we will get this hardware speedup. The problem is that the hardware guys are tired of giving the software guys gigantic speed boosts without forcing the software guys to change their behavior at all. They are holding a worker's revolution and throwing off the shackles of the oppressors and saying, "you want more speed, write your programs more parallely!"

He and the guy here do mention the problem of playing music while surfing the web and checking email (which are often both in the same program, but whatever). On a single core system, playing a song, unzipping a file, and running a browser would lead to some whole-system slowdown, and this problem was greatly helped by multi-core computers, nobody is arguing that. The problem is that this process-level parallelism only helps you until you have more cores than processes, which really isn't that far off. Think about how many programs you are running now. Aside from OS background services which spend the vast majority of their time asleep, there are probably between 2 and 5.

Let's say that you are playing a movie, browsing a flashy website, talking to someone over VoIP, and compiling your latest project. That's four processes. You will see great improvements in responsiveness and speed all the way up to 4 cores, but after that, you will be sitting idle. If each of those programs is single-threaded, or its main CPU-intensive portion is single-threaded, then adding more cores won't help you at all.

This is the point that the author of the second article misses. There may be 446 threads on his system, but, as he observes, many of them are for drawing the GUI or doing some form of I/O. Drawing a GUI takes relatively little processor power compared with something like encoding a video (unless you are running Vista, har har), and for I/O threads, most of what they are doing looks like this:


while (! is_burnt_out(sun)) {
  data = wait_until_ive_gotten_something();
  copy(data, buffer);
}

In other words, not really doing anything with the CPU. This means that, although there are a great number of threads, only a few of them (Firefox does all of its rendering and JavaScript execution in a single thread for all open windows and tabs) actually will be using the processor. The "crisis" that everyone is talking about is when those single computation threads start to get overloaded. With so many things moving to the browser, what good is a 128-core CPU if all of my precious web apps all run in Firefox's single thread? I have these 768 cores sitting around, wouldn't it be nice to use more than 2 of them for video encoding?

Just Make The Compilers Do It

One thing that the second articled brings up is automatically-parallelizing compilers. I do think that there is something to be said for these, especially since it has been shown over and over, first with assembly and then with garbage collection, that compilers and runtimes will get smarter than programmers at doing things like emitting assembly or managing memory. I would not rule out the chance that a similar thing will happen with parallelization.

I do think that making parallelizing compilers will not be as "easy" as writing good optimizing compilers or good garbage collectors. The problem is that the compiler would have to have a broad enough view of a serial program to be able to "figure out" how it can be broken up and what parts can be run concurrently to be able to generate very concurrent code. This would go beyond inlining functions or unrolling loops to figuring out when to spawn new threads and how and when to add locks to serial code to extract maximum performance. Far be it from me to say that we will "never" come up with something that smart, but I seriously doubt that we will be able to code exactly as before and have our compiler do all the magic for us.

Save Us, Oh Great One!

So what do we do? The answer is not yet clear. There are clearly a lot of problems with traditional multithreading with locks (a la pthreads and Java Threads), but nobody seems to agree on a clear better way of doing things. I saw a cool video here about a lock-free hash table. The idea of using a finite state machine (FSM) for designing a concurrent program is fascinating, but I could see problems with data structures involving more dependent elements, like a balanced tree. Still, I think the approach gives a good insight on one way to progress.

A similar, but less... existing idea is that of COSA, a system by renowned and famous crackpot Louis Savain. He uses an interesting, but basically unimplemented graphical model for doing concurrent programming and talks about the current problems with concurrent code.

Now, judging from the tone of this article and the titles of some of his other posts (Encouraging Mediocrity at the Multicore Association, Half a Century of Crappy Computing), he seems to be 90% troll, 10% genius inventor. I took a brief look at his description of COSA, and it seems to have some similar properties to the (much more legitimate) lock-free hash table. The idea of modelling complex program behavior in a series of state machines or in a big graph, but a lot would need to be done to make this into a practical system.

As interesting as COSA looks, the author seems to be too much of a Cassandra for it to gain any appeal. I mean "Cassandra" in the "can see the future, but nobody believes him," sense, not the "beauty caused Apollo to grant her the gift of prophecy" part (but who knows, I've never seen the guy).

I see some evolution of parts of each of these ideas as being a potential solution for some of the multiprocessing problems. The guy who made the lock-free hash table was working on the level of individual CASs, which would definitely need to change for a widely used language. Just as "goto" was too low-level and was replaced with "if", "for", and "map", some useful abstraction over CAS will come along and enable concurrent programming at a higher level of abstraction.

Monday, April 21, 2008

On Voting Systems and Hash Trees

So after reading this article, I started thinking a bit about the implementation side of some of our voting problems.

First, let's think about requirements. A good voting system should be:

Private, in that nobody can force you to reveal who you voted for. This is important because it prevents someone from threatening a voter for voting for a certain candidate. Also, if someone publicly states that they support one candidate but choose to vote for another, it could be damaging to their reputation to have it exposed that they voted differently. Who someone votes for is their private business.

Auditable. There must be some way to prove that the votes cast are the same ones counted for deciding who is president (or Governor, or senator, etc.).

Verifiable for individuals. If I suspect that my vote was altered, I should be able to check that the vote that I made has not been modified before being counted.

The traditional way of using paper ballots works to a point, but it is possible to modify paper, create fake ballots, and of course, the infamous "dangling chad" fiasco.

Thus, I have spent some time brainstorming a cryptographically secure system using public keys and hash trees.

There would be a master key pair for the whole election, owned by the federal body in charge of handling elections. Each state would have its own key pair, as would each polling station and each machine. All of the state public keys would be signed by the federal key, each state would sign the keys of all of the polling stations, and the polling stations would sign the keys of each machine.

When a person then votes, a hash of their vote (optionally with some random value) is signed by the voting machine's key and by the polling station's key. Before the user votes at a station, they can verify that the state voting authority actually signed the key of the polling station, and then verify that the polling station signed the key of the voting machine. Assuming the federal key could be accurately transmitted to everyone, which could be done over a number of very public channels, the user could verify all of the keys down to that of the individual machine at which they are voting.

The purpose of this is so that if the user suspects that their vote was not counted or was manipulated, they can go to the polling station or state court and prove that a polling station authorized by the state registered their vote. This would prevent people from being lured to fake polling stations which simply threw away their vote.

Also, to ensure that the state received the vote, as part of the process the user's vote could be sent (securely) to a computer owned by the state voting group and a signature sent back to the user. This way, the user would have proof that the state received the vote that the user intended.

The next problem is allowing users to force the government to prove that the vote used for counting is the same as the vote that the user made. This could be done with a hash tree. Once each state had collected all of the votes, it would assemble a hash tree using the earlier hashes of votes as the leaves. The state would then sign and publish the root hash. If a user wanted to verify that the government actually used his or her vote in the final count, the user would ask for a validation of their hash, which would require the government transmit only a logarithmic number of hashes to the user. If these hashes matched the public root hash, then the user would know that his or her vote was accounted for.

When the user goes to a voting station, they would take with them a copy of the keys used by both the federal government and the state. The user would leave with a secret random number which, with their vote, could be used to reconstruct the user's vote hash as well as signatures from the voting machine, polling station, and the state. At home, the user could manually verify that their vote hash was correctly calculated, as well as verify the signatures of each of the authorities who signed the user's vote hash.

The only issue is the one of the random number. Since there are so few choices in an election, without any padding, most hashes would be the same. If the options were "I vote for McCain" and "I vote for Obama", then using the SHA1 algorithm, there would be only two different hashes:

"I vote for McCain"
2647414cd4c7769b05fcb68c9b6dde321fdaa49c

"I vote for Obama"
65d5d0d9855878e256a75c8c7fce688b4877c193

Someone could then simply look and see which hash was given and determine who the user voted for. The solution typically used for ensuring the authenticity of messages is to add some random data (that is stored for validation) to the message. A simple example is to include the current time:

"I vote for Obama at Mon Apr 21 17:52:07 EDT 2008"
e194810aa658f103e5359a985fd35e722eb3e857

"I vote for Obama at Mon Apr 21 17:52:49 EDT 2008"
bcf9b0934b562621b8a861ce8eb427c2fa99da16

This could be supplemented with additional data to reduce the likelihood that two people voting for the same candidate would wind up with the same hash, since it is easy to imagine that two people in a country of 300 million could vote for the same candidate in the same second.

There is still the problem of hiding the vote though. The user needs that data at the end of the day to be able to verify that the hash correctly identifies the user's vote, but if an adversary had the random data and the hash, they could figure out the user's vote as easily as before. I don't have a good solution for this yet, one possible option is giving the user a hash of the random data, and mailing the actual value to the user separately, but that is cumbersome.

Anyway, I doubt that this system will actually be used in a real voting system. It is fairly simple to implement and could be done using existing standards and free, open source software with a minimal amount of additional GUI code, but it would be fairly difficult to have a multi-million dollar business around this and therefore tough to bribe the required officials into mandating an insecure and fundamentally flawed system.

Maybe next week, I'll implement this whole thing in a under 1,000 lines of Ruby.

Wednesday, April 2, 2008

Oh this is too good

I thought I had found a hilariously nonsensical argument for proof of the existence of God in my last post (which basically boiled down to "blah blah blah, because I say so, blah blah, blah blah"), but this one takes the cake.

I originally found the site when it was liked off of some Slashdot story. The link was to a comparison of the Open Source software movement to Islamic terrorism. Oh it gets better.

The first flag was that the site is called "The Objective Observer." Anything that needs to remind you in its URL that it is objective cannot be headed anywhere good. The about page contains the string "The Objective Observer" no fewer than 24 times. I must note that every single one of these are links back to the main page.

So back to the matter at hand. The article itself is a masterpiece of non sequitur, jumping from one irrelevant rambling to the next without a hint of justification or transition in sight. I have taken the time to sum up the article in a few sentences:

The Greeks had some vague idea about something like DNA, which caused them to come up with their gods which later became the Christian god. There were three Fates and something else called Fortune, and there are four base pairs in DNA, so therefore the Greeks had a magical insight about life. There were twelve titans and twelve "immortal Olympians," and since twelve is divisible by four, there is a genetic basis for the Greek gods.

DNA was created by "aliens, time travelers, divine intervention, psychic abilities and 'primordial link'." Since some animals have a faint ability to sense magnetism [which appears to actually be the case, to my surprise], this is true. Humans lost their primordial link because we got too intellectual.

"Pi is the unsolvable equation for life because pi represents a perfect circle." Greek mythology and chemistry can prove the value of pi.

God is highly evolved life.

The existence of DNA proves the possibility of Maxwell's Demon.

Scientists have unduly ignored creationist stories. Since it is easier to say "god did it," than it is to understand cosmic background radiation, Occam's Razor says that the creationists must be right.

What if aliens are driving our evolution as a means of intergalactic warfare?

DNA is like a really advanced computer [again, this is actually the case].

Humans will become advanced enough to create life, but doing so would require all the energy in the universe. DNA. Pi.

God exists. QED.

If you have a lot of time to kill, take a look at his (or her) discussion of pi, in which "The Lion King," the fact that it is impossible to draw a perfect circle, and the fact that pi appears in a formula involving gravity proves that pi is cosmically tied to life and something about god.

Be sure to take it in limited doeses and not before any test or assignment which requires logical though.

Monday, January 7, 2008

Why do I feed the trolls?

So this is a little something I wrote in response to a decently popular video purporting to "prove the existence of God."

*Sigh* Since this video seems popular, I guess I'll bite.

Although I appreciate that you are going about it in a logical fashion, I suggest
that you brush up a bit on your first-order logic.

At a very basic level, the bulk of your video is essentially saying, "For all
theories of cosmology, all are false except for my notion of Christian
creation." You then proceed to (attempt) to disprove Big Bang theory and
conclude that Christian creation is the only remaining option. In order for this
approach to work, you would exhaustively need to disprove all other theories of
cosmology, a difficult task since there are a countably infinite number of them.

I will not contest your point about time starting at a definite point, though I
imagine even that could be refuted (related to the infinite count of numbers
between 0 and 1, or any two real numbers for that matter).

One minor point which does not necessarily weaken your argument, the "matter cannot
be created or destroyed" thing is not strictly true unless one includes energy
in the definition of matter. This is because matter and energy are different
permutations of the same phenomenon, and can be converted between each other,
but I digress.

So we then come to about minute 3 of your video. At some point we had "we don't
know" and then later we had a whole bunch of stuff without any
explanation. Current cosmology does not have an answer for how the universe
started, but we do know that within a few picoseconds of the origin of the
universe, everything behaved according to our present understanding of physics
and relativity (See COBE and cosmic background radiation).

Now, the first logical mistake you make is assuming that because we don't know
means that we won't know or can't know. A couple thousand years ago we didn't
know how lightning worked and attributed that to Thor. Given our present
understanding of electricity, we know exactly how it works and can even
reproduce the same phenomenon in labs or high schools. Back then, lightning
might have seemed just as mythical as existence. Maybe in a couple hundred or
thousand years, high school students will be creating miniature big bangs to
make their friends' hair stand up.

I'll agree that there was a great deal of energy/mass involved in whatever
created the universe. It created approximately 10^80 atoms plus however many
have been converted to energy over time (probably less than an order of
magnitude).

The problem is that you implicitly assume that there was something to create it
which had to have a greater or equal amount of energy to start out with. It is
commonly believed that the current laws of physics did not apply in the first
few instants of the universe. This seems like a cop-out, but remember the
dichotomy between classical mechanics and quantum mechanics. There is a set of
Newtonian laws governing all physical objects until you look at things that are
small enough, at which point things cease to have discreet positions. Given that
fact, it is a small leap to say that there were different rules for the origins
of the universe in which matter generates more matter. One atom appears and,
according to this new set of rules, starts multiplying. It sounds weird, but no
less weird than probability distributions involving complex numbers (as in
quantum mechanics).

So there is no need for an immensely powerful entity to impart a portion of its
energy into the fledgling universe.

As for "extremely intelligent," there is no need at all for that. Your argument
for it being intelligent is that "it created enough stuff to ..." I fail to see
how "a process creating a certain volume of material" is necessarily
intelligent. A fire burning down a tree creates enough stuff to fertilize the
ground around it to sustain life for new plants. Is the fire then "intelligent?"
The sun produces enough energy to sustain life on Earth. Is the sun
"intelligent?"

I think part of the problem is a lack of understanding of how elements are
created. Assuming one starts with only Hydrogen (more likely it was even more
basic than that, but this is not a lesson in subatomic particles), one can get
all elements currently in existence. In brief, normal fusion within stars
produces successively heavier elements until about iron and nickel, which have
the highest binding energy per nucleon (so no more energy can be extracted from
fusion). This process is called Nucleosynthesis (wikipedia it) and it is the
reason why the overwhelming majority of the universe is made up of hydrogen
(about 73% by mass) followed by helium (24% by mass).

So all the creator has to do is make a crapload of hydrogen and kick back as it
fused into helium, lithium and so on.

Your argument was "(awesomely powerful) AND (amazingly intelligent) implies
(Christan God)." I have shown that the origin of the universe need not be
awesomely powerful and certainly not intelligent, so it is incorrect to conclude
that this is a correct proof of the existence of a Christan god.

On a side-note, I feel like there could be something in the Bible about it being
bad to demand proof of God's existence before worshipping him, but I could be
wrong; I never read the thing.