This Week in Hax

Saturday, June 7, 2008

Really Really Simple Syndication

From the everything is miscellaneous and web that wasn't talks, it seems like what we really need is some way for computers to figure out what information we want and give it to us. There is a huge amount of info encoded in blogs, the example of nearly one blog post for each word in a Bush speech about immigration, but so far, there is no good way of extracting much of it. Sure, we can google for "blogs about bush's immigration speech," but even that would likely turn up a bunch of blogs about random things, a bunch of pages about unrelated immigration topics, and so on.

The first step would be to be able to do something like "tell me all of the instances in which bush has changed his position on immigration" which would look at all the blogs which referenced the speech and found things relating to a change in the stated position. This would probably require natural language understanding of queries, but that might not even be necessary, as it could simply relate the words in the query with the words in the blogs.

The next thing needed would be for it to be anonymous (or at least have that possibility) and distributed. Aside from privacy concerns, there could be massive scalability problems, a la Twitter. There is the concept of a distributed hash table, but that only works for exact matches. Wikipedia to the rescue, with this (PDF) bit of magic from Berkley.

Personalized feeds

However, this would only be the beginning. Once you can respond to general natural-language queries like that, you could build up a list of what someone was interested in. RSS feeds are a fascinating concept and allow people to get a feed of news customized for their tastes, but it assumes that everything at a certain address will be interesting to me.

If you go to the RSS page on the New York times, you will see that they provide feeds for "Business," "Arts," "Automobiles," and so on. Going down further, there are feeds for just "Media & Advertising" "World Business" under business and "Design" and "Music" for arts, but nothing under automobiles. This is the kind of problem that David Weinberger was talking about: what if all "Arts" are pretty much the same for me, but I want to differentiate between "Foreign Cars" and "Domestic Cars," or even "Sports Cars" and "Trucks." Maybe I don't care about red cars at all, so I don't ever want to see a story about red cars in my feed.

The problem with RSS is that, even though it is a huge step forward for allowing us to keep track of many different sources of news at once, someone else has to decide how to split up the feeds. Most places give you an option of what you might want; the current system the New York Times is much better than having one giant "this is everything" feed, but it still has a ways to go. This could be done without having to rewrite our RSS clients by allowing the user of a site to set up a custom RSS URL from which to pull updates, but this would place a huge burden on the content providers and would not scale at all.

The solution

So after we have a good way for the user to tell the system what to get, we would want a way for the system to learn what the user likes, possibly using a Netflix-like recommendation system, and automatically pull down stories that the user likes.

If President Bush gives a new speech about policy for the Internets and 1,000 people blog about it, our little system should automatically sift through all of the posts, figure out which parts of the blogs would likely interest you, based on what you have told it you like in the past, and present you with some sort of reasonable compilation of the information, all without any user interaction. This would really unlock the potential of the Internet. Get hacking.

Thursday, May 29, 2008

Missing the point

While making my daily rounds of programming language-related blog posts, I came across a couple items which caught my attention. I have been very interested in parallel and concurrent programming of late, especially how to solve the big issues everyone seems worried about relating to the non-easily parallelizable code.

I saw a couple of posts, after following a couple of branches off of a Slashdot story, which seemed to confuse a few of the issues surrounding parallel programming. This one in particular, confuses a number of different problems.

Starting off, he correctly points out that:

Users don't care about parallel processing anymore than they care about how RAM works or what a hash table is. They care about getting their work done.

Assuming that he is not talking about "people writing programs" as users, he is absolutely correct. As long as something works well and fast enough, nobody cares.

But therein lies the problem: well and fast enough. The "well" part is fairly simple: if you are doing multiprocessing, it still has to work. That's pretty obvious, and while it can be challenging at times, there is no real controversy over that fact.

This leaves the "fast enough" part. The problem here is that since the dawn of time (which according to my computer is January 1, 1970), people have been able to count on future computers getting faster. Moore's law and all. Nowadays, computers get faster by adding more cores, but software is still written assuming that we will get this hardware speedup. The problem is that the hardware guys are tired of giving the software guys gigantic speed boosts without forcing the software guys to change their behavior at all. They are holding a worker's revolution and throwing off the shackles of the oppressors and saying, "you want more speed, write your programs more parallely!"

He and the guy here do mention the problem of playing music while surfing the web and checking email (which are often both in the same program, but whatever). On a single core system, playing a song, unzipping a file, and running a browser would lead to some whole-system slowdown, and this problem was greatly helped by multi-core computers, nobody is arguing that. The problem is that this process-level parallelism only helps you until you have more cores than processes, which really isn't that far off. Think about how many programs you are running now. Aside from OS background services which spend the vast majority of their time asleep, there are probably between 2 and 5.

Let's say that you are playing a movie, browsing a flashy website, talking to someone over VoIP, and compiling your latest project. That's four processes. You will see great improvements in responsiveness and speed all the way up to 4 cores, but after that, you will be sitting idle. If each of those programs is single-threaded, or its main CPU-intensive portion is single-threaded, then adding more cores won't help you at all.

This is the point that the author of the second article misses. There may be 446 threads on his system, but, as he observes, many of them are for drawing the GUI or doing some form of I/O. Drawing a GUI takes relatively little processor power compared with something like encoding a video (unless you are running Vista, har har), and for I/O threads, most of what they are doing looks like this:


while (! is_burnt_out(sun)) {
  data = wait_until_ive_gotten_something();
  copy(data, buffer);
}

In other words, not really doing anything with the CPU. This means that, although there are a great number of threads, only a few of them (Firefox does all of its rendering and JavaScript execution in a single thread for all open windows and tabs) actually will be using the processor. The "crisis" that everyone is talking about is when those single computation threads start to get overloaded. With so many things moving to the browser, what good is a 128-core CPU if all of my precious web apps all run in Firefox's single thread? I have these 768 cores sitting around, wouldn't it be nice to use more than 2 of them for video encoding?

Just Make The Compilers Do It

One thing that the second articled brings up is automatically-parallelizing compilers. I do think that there is something to be said for these, especially since it has been shown over and over, first with assembly and then with garbage collection, that compilers and runtimes will get smarter than programmers at doing things like emitting assembly or managing memory. I would not rule out the chance that a similar thing will happen with parallelization.

I do think that making parallelizing compilers will not be as "easy" as writing good optimizing compilers or good garbage collectors. The problem is that the compiler would have to have a broad enough view of a serial program to be able to "figure out" how it can be broken up and what parts can be run concurrently to be able to generate very concurrent code. This would go beyond inlining functions or unrolling loops to figuring out when to spawn new threads and how and when to add locks to serial code to extract maximum performance. Far be it from me to say that we will "never" come up with something that smart, but I seriously doubt that we will be able to code exactly as before and have our compiler do all the magic for us.

Save Us, Oh Great One!

So what do we do? The answer is not yet clear. There are clearly a lot of problems with traditional multithreading with locks (a la pthreads and Java Threads), but nobody seems to agree on a clear better way of doing things. I saw a cool video here about a lock-free hash table. The idea of using a finite state machine (FSM) for designing a concurrent program is fascinating, but I could see problems with data structures involving more dependent elements, like a balanced tree. Still, I think the approach gives a good insight on one way to progress.

A similar, but less... existing idea is that of COSA, a system by renowned and famous crackpot Louis Savain. He uses an interesting, but basically unimplemented graphical model for doing concurrent programming and talks about the current problems with concurrent code.

Now, judging from the tone of this article and the titles of some of his other posts (Encouraging Mediocrity at the Multicore Association, Half a Century of Crappy Computing), he seems to be 90% troll, 10% genius inventor. I took a brief look at his description of COSA, and it seems to have some similar properties to the (much more legitimate) lock-free hash table. The idea of modelling complex program behavior in a series of state machines or in a big graph, but a lot would need to be done to make this into a practical system.

As interesting as COSA looks, the author seems to be too much of a Cassandra for it to gain any appeal. I mean "Cassandra" in the "can see the future, but nobody believes him," sense, not the "beauty caused Apollo to grant her the gift of prophecy" part (but who knows, I've never seen the guy).

I see some evolution of parts of each of these ideas as being a potential solution for some of the multiprocessing problems. The guy who made the lock-free hash table was working on the level of individual CASs, which would definitely need to change for a widely used language. Just as "goto" was too low-level and was replaced with "if", "for", and "map", some useful abstraction over CAS will come along and enable concurrent programming at a higher level of abstraction.

Monday, April 21, 2008

On Voting Systems and Hash Trees

So after reading this article, I started thinking a bit about the implementation side of some of our voting problems.

First, let's think about requirements. A good voting system should be:

Private, in that nobody can force you to reveal who you voted for. This is important because it prevents someone from threatening a voter for voting for a certain candidate. Also, if someone publicly states that they support one candidate but choose to vote for another, it could be damaging to their reputation to have it exposed that they voted differently. Who someone votes for is their private business.

Auditable. There must be some way to prove that the votes cast are the same ones counted for deciding who is president (or Governor, or senator, etc.).

Verifiable for individuals. If I suspect that my vote was altered, I should be able to check that the vote that I made has not been modified before being counted.

The traditional way of using paper ballots works to a point, but it is possible to modify paper, create fake ballots, and of course, the infamous "dangling chad" fiasco.

Thus, I have spent some time brainstorming a cryptographically secure system using public keys and hash trees.

There would be a master key pair for the whole election, owned by the federal body in charge of handling elections. Each state would have its own key pair, as would each polling station and each machine. All of the state public keys would be signed by the federal key, each state would sign the keys of all of the polling stations, and the polling stations would sign the keys of each machine.

When a person then votes, a hash of their vote (optionally with some random value) is signed by the voting machine's key and by the polling station's key. Before the user votes at a station, they can verify that the state voting authority actually signed the key of the polling station, and then verify that the polling station signed the key of the voting machine. Assuming the federal key could be accurately transmitted to everyone, which could be done over a number of very public channels, the user could verify all of the keys down to that of the individual machine at which they are voting.

The purpose of this is so that if the user suspects that their vote was not counted or was manipulated, they can go to the polling station or state court and prove that a polling station authorized by the state registered their vote. This would prevent people from being lured to fake polling stations which simply threw away their vote.

Also, to ensure that the state received the vote, as part of the process the user's vote could be sent (securely) to a computer owned by the state voting group and a signature sent back to the user. This way, the user would have proof that the state received the vote that the user intended.

The next problem is allowing users to force the government to prove that the vote used for counting is the same as the vote that the user made. This could be done with a hash tree. Once each state had collected all of the votes, it would assemble a hash tree using the earlier hashes of votes as the leaves. The state would then sign and publish the root hash. If a user wanted to verify that the government actually used his or her vote in the final count, the user would ask for a validation of their hash, which would require the government transmit only a logarithmic number of hashes to the user. If these hashes matched the public root hash, then the user would know that his or her vote was accounted for.

When the user goes to a voting station, they would take with them a copy of the keys used by both the federal government and the state. The user would leave with a secret random number which, with their vote, could be used to reconstruct the user's vote hash as well as signatures from the voting machine, polling station, and the state. At home, the user could manually verify that their vote hash was correctly calculated, as well as verify the signatures of each of the authorities who signed the user's vote hash.

The only issue is the one of the random number. Since there are so few choices in an election, without any padding, most hashes would be the same. If the options were "I vote for McCain" and "I vote for Obama", then using the SHA1 algorithm, there would be only two different hashes:

"I vote for McCain"
2647414cd4c7769b05fcb68c9b6dde321fdaa49c

"I vote for Obama"
65d5d0d9855878e256a75c8c7fce688b4877c193

Someone could then simply look and see which hash was given and determine who the user voted for. The solution typically used for ensuring the authenticity of messages is to add some random data (that is stored for validation) to the message. A simple example is to include the current time:

"I vote for Obama at Mon Apr 21 17:52:07 EDT 2008"
e194810aa658f103e5359a985fd35e722eb3e857

"I vote for Obama at Mon Apr 21 17:52:49 EDT 2008"
bcf9b0934b562621b8a861ce8eb427c2fa99da16

This could be supplemented with additional data to reduce the likelihood that two people voting for the same candidate would wind up with the same hash, since it is easy to imagine that two people in a country of 300 million could vote for the same candidate in the same second.

There is still the problem of hiding the vote though. The user needs that data at the end of the day to be able to verify that the hash correctly identifies the user's vote, but if an adversary had the random data and the hash, they could figure out the user's vote as easily as before. I don't have a good solution for this yet, one possible option is giving the user a hash of the random data, and mailing the actual value to the user separately, but that is cumbersome.

Anyway, I doubt that this system will actually be used in a real voting system. It is fairly simple to implement and could be done using existing standards and free, open source software with a minimal amount of additional GUI code, but it would be fairly difficult to have a multi-million dollar business around this and therefore tough to bribe the required officials into mandating an insecure and fundamentally flawed system.

Maybe next week, I'll implement this whole thing in a under 1,000 lines of Ruby.

Wednesday, April 2, 2008

Oh this is too good

I thought I had found a hilariously nonsensical argument for proof of the existence of God in my last post (which basically boiled down to "blah blah blah, because I say so, blah blah, blah blah"), but this one takes the cake.

I originally found the site when it was liked off of some Slashdot story. The link was to a comparison of the Open Source software movement to Islamic terrorism. Oh it gets better.

The first flag was that the site is called "The Objective Observer." Anything that needs to remind you in its URL that it is objective cannot be headed anywhere good. The about page contains the string "The Objective Observer" no fewer than 24 times. I must note that every single one of these are links back to the main page.

So back to the matter at hand. The article itself is a masterpiece of non sequitur, jumping from one irrelevant rambling to the next without a hint of justification or transition in sight. I have taken the time to sum up the article in a few sentences:

The Greeks had some vague idea about something like DNA, which caused them to come up with their gods which later became the Christian god. There were three Fates and something else called Fortune, and there are four base pairs in DNA, so therefore the Greeks had a magical insight about life. There were twelve titans and twelve "immortal Olympians," and since twelve is divisible by four, there is a genetic basis for the Greek gods.

DNA was created by "aliens, time travelers, divine intervention, psychic abilities and 'primordial link'." Since some animals have a faint ability to sense magnetism [which appears to actually be the case, to my surprise], this is true. Humans lost their primordial link because we got too intellectual.

"Pi is the unsolvable equation for life because pi represents a perfect circle." Greek mythology and chemistry can prove the value of pi.

God is highly evolved life.

The existence of DNA proves the possibility of Maxwell's Demon.

Scientists have unduly ignored creationist stories. Since it is easier to say "god did it," than it is to understand cosmic background radiation, Occam's Razor says that the creationists must be right.

What if aliens are driving our evolution as a means of intergalactic warfare?

DNA is like a really advanced computer [again, this is actually the case].

Humans will become advanced enough to create life, but doing so would require all the energy in the universe. DNA. Pi.

God exists. QED.

If you have a lot of time to kill, take a look at his (or her) discussion of pi, in which "The Lion King," the fact that it is impossible to draw a perfect circle, and the fact that pi appears in a formula involving gravity proves that pi is cosmically tied to life and something about god.

Be sure to take it in limited doeses and not before any test or assignment which requires logical though.

Monday, January 7, 2008

Why do I feed the trolls?

So this is a little something I wrote in response to a decently popular video purporting to "prove the existence of God."

*Sigh* Since this video seems popular, I guess I'll bite.

Although I appreciate that you are going about it in a logical fashion, I suggest
that you brush up a bit on your first-order logic.

At a very basic level, the bulk of your video is essentially saying, "For all
theories of cosmology, all are false except for my notion of Christian
creation." You then proceed to (attempt) to disprove Big Bang theory and
conclude that Christian creation is the only remaining option. In order for this
approach to work, you would exhaustively need to disprove all other theories of
cosmology, a difficult task since there are a countably infinite number of them.

I will not contest your point about time starting at a definite point, though I
imagine even that could be refuted (related to the infinite count of numbers
between 0 and 1, or any two real numbers for that matter).

One minor point which does not necessarily weaken your argument, the "matter cannot
be created or destroyed" thing is not strictly true unless one includes energy
in the definition of matter. This is because matter and energy are different
permutations of the same phenomenon, and can be converted between each other,
but I digress.

So we then come to about minute 3 of your video. At some point we had "we don't
know" and then later we had a whole bunch of stuff without any
explanation. Current cosmology does not have an answer for how the universe
started, but we do know that within a few picoseconds of the origin of the
universe, everything behaved according to our present understanding of physics
and relativity (See COBE and cosmic background radiation).

Now, the first logical mistake you make is assuming that because we don't know
means that we won't know or can't know. A couple thousand years ago we didn't
know how lightning worked and attributed that to Thor. Given our present
understanding of electricity, we know exactly how it works and can even
reproduce the same phenomenon in labs or high schools. Back then, lightning
might have seemed just as mythical as existence. Maybe in a couple hundred or
thousand years, high school students will be creating miniature big bangs to
make their friends' hair stand up.

I'll agree that there was a great deal of energy/mass involved in whatever
created the universe. It created approximately 10^80 atoms plus however many
have been converted to energy over time (probably less than an order of
magnitude).

The problem is that you implicitly assume that there was something to create it
which had to have a greater or equal amount of energy to start out with. It is
commonly believed that the current laws of physics did not apply in the first
few instants of the universe. This seems like a cop-out, but remember the
dichotomy between classical mechanics and quantum mechanics. There is a set of
Newtonian laws governing all physical objects until you look at things that are
small enough, at which point things cease to have discreet positions. Given that
fact, it is a small leap to say that there were different rules for the origins
of the universe in which matter generates more matter. One atom appears and,
according to this new set of rules, starts multiplying. It sounds weird, but no
less weird than probability distributions involving complex numbers (as in
quantum mechanics).

So there is no need for an immensely powerful entity to impart a portion of its
energy into the fledgling universe.

As for "extremely intelligent," there is no need at all for that. Your argument
for it being intelligent is that "it created enough stuff to ..." I fail to see
how "a process creating a certain volume of material" is necessarily
intelligent. A fire burning down a tree creates enough stuff to fertilize the
ground around it to sustain life for new plants. Is the fire then "intelligent?"
The sun produces enough energy to sustain life on Earth. Is the sun
"intelligent?"

I think part of the problem is a lack of understanding of how elements are
created. Assuming one starts with only Hydrogen (more likely it was even more
basic than that, but this is not a lesson in subatomic particles), one can get
all elements currently in existence. In brief, normal fusion within stars
produces successively heavier elements until about iron and nickel, which have
the highest binding energy per nucleon (so no more energy can be extracted from
fusion). This process is called Nucleosynthesis (wikipedia it) and it is the
reason why the overwhelming majority of the universe is made up of hydrogen
(about 73% by mass) followed by helium (24% by mass).

So all the creator has to do is make a crapload of hydrogen and kick back as it
fused into helium, lithium and so on.

Your argument was "(awesomely powerful) AND (amazingly intelligent) implies
(Christan God)." I have shown that the origin of the universe need not be
awesomely powerful and certainly not intelligent, so it is incorrect to conclude
that this is a correct proof of the existence of a Christan god.

On a side-note, I feel like there could be something in the Bible about it being
bad to demand proof of God's existence before worshipping him, but I could be
wrong; I never read the thing.

Saturday, November 17, 2007

Global Media Keyboard Shortcuts for Bongo

I am now currently on hour 9 of my current stay in the Electrical Engineering lab of Barus & Holley, and going a little insane. Luckily, I figured out a neat little trick that will make my media playing under Bongo that much more pleasant.

In Ubuntu, at least for me, the media keys on my Thinkpad T60 work great with Rhythmbox without any fancy configuration. This is great if I wanted to use that particular player, but I have bigger plans. I want Emacs to be my jukebox as well.

Unfortunately, those lovely keyboard shortcuts do not work with Bongo by default. Luckily, with the help of KeyTouch, I was able to get it working. KeyTouch is a fairly straightforward program, and there is plenty of additional info available here.

What I learned is that you have to disable the shortcuts you want to manage with KeyTouch in Ubuntu's System > Preferences > Keyboard Shortcuts since they will take precedence over anything you define in KeyTouch.

In KeyTouch, you will have a list of keys and their actions. Let's say we want to map the play/pause command for Bongo to the "Fn-Down" combination (this happens to be the play/pause key on my laptop). We can give KeyTouch a program to execute, but which one to do? We already have a running Emacs instance, so opening a new one would be just wrong. Whatever shall we do?

Enter EmacsClient. It's basic goal is to allow the environment to set it as $EDITOR variable and open files in an existing Emacs instance. One lovely thing about it is that it allows invocation of arbitrary elisp expressions as well.

Once you have it set up, you can use the following to pause the currently playing Bongo file:


emacsclient -e '(bongo-pause/resume)'

To pause or resume (shocker, I know). The elisp functions bongo-next and bongo-previous allow you to jump forward or back through your current playlist.

Now I have one less reason to ever leave trusty ole Emacs behind. Stay tuned for when I switch all of my email over from Gmail to Emacs!

Friday, November 16, 2007

Many TeXnicalities

So it has been a while since my last post. Life has been busy.

Specifically I spent most of last weekend dancing and all of this week working. Starting at about 3:00PM on Friday, November 9th and lasting until about 8:00PM on Sunday, I was doing some form of dance. It started with about an hour of Swing practice with Sarah for another showing of our Aladdin piece followed by the actual performance of it at 5. After that I had an hour of free time before Ballroom practice. You see, there was a competition Sunday, so every free moment I could find was spent meeting with partners.

After that, rather than take the smart route out and go to sleep, I made the long trek over to Machado for about three hours of... dancing.

Saturday was similar. I was meeting with ballroom partners up until it was time to prepare for the one and only SexPowerGod. While that may not have been purely dancing per se, it was most definitely tiring. After that, I had two hours to shower and change into something halfway respectable for the Brown Ballroom Competition for which I had to leave at 5:30AM. Being the largest single-day ballroom comp on the east coast, it went for a while and took the last bits of energy out of me until I could not even successfully point a video camera at stationary dancing couple.

My exhausting weekend aside, this post is about the glories of LaTeX. I picked it up sometime last year when CS22 suggested that all homework be done with it and have become so hooked on it that I almost wrote a $ before a mathematical equation in class today.

For those of you who don't know it, LaTeX is what you would turn to if you feel that word processing should be a subset of programming or feel the need to use Emacs for all aspects of their lives (I added music player to that list last night). Rather than pressing Ctrl-B to make text bold, you would write something like this:


  \textbf{Bold text!}

The main point of this post is to gush over two related discoveries I recently made regarding LaTeX and Emacs. They are WhizzyTeX and Active-DVI.

WhizzyTeX is a minor mode for LaTeX editing in Emacs which opens an external viewer to mirror changes made to the source file in real time. So as you type, the dvi file gets update, AUTOMAGICALLY! For plain text this is not that big of a deal, but for complicated mathematical expressions or manipulation of parameters to get a look right, this cuts down on the edit-compile-review loop which can suck time away from creating masterful works. It also has the benefit of real-time error checking beyond what AucTeX already provides, so it can catch mismatched braces or unrecognized control sequences as you (mis-)type them. It is kind of like what I think flymake is like (I have never actually used flymake, so I don't know for sure). Finally, it has some fancy features which depend on Active-DVI (advi) like mouse-editing of LaTeX documents, such as dragging a control to size an hbox. Neat, but not often used.

Advi is like xdvi on steroids. It is a viewer as well as a document style for making snazzy presentations. Seriously, check out some of the stuff on the site. It can embed videos, launch external applications, and even embed those applications in the same window as the presentation, so you could potentially run Emacs within a dvi or even run advi within itself recursively! The possibilities are endless.

Unfortunately I have not had to make a presentation any time within the last couple years, so it is unlikely I will get to take advantage of all of that sweetness, but it is cool to know it's there.

My only gripe so far is that WhizzyTeX necessarily opens a new window for the preview. I have the screen real-estate, but it would be nice not to have to leave Emacs for anything (soon I'll have email in there as well). I did some searching for the possibility of embedding an X11 application within an emacs buffer (like how it displays pictures, but again on some fat 'roids). That way I could have advi sitting in an Emacs buffer and put that in a separate window from the source and have all of that wonderful interaction. I did some searching but was unable to find anything suggesting that this is possible. If anyone knows anything about it, please let me know. My X11 hacking skills are insufficient to just do it myself, so that's out.

Anyway, I'm off to TA CS31. Get more sleep than I did this week!