Saturday, June 7, 2008

Really Really Simple Syndication

From the everything is miscellaneous and web that wasn't talks, it seems like what we really need is some way for computers to figure out what information we want and give it to us. There is a huge amount of info encoded in blogs, the example of nearly one blog post for each word in a Bush speech about immigration, but so far, there is no good way of extracting much of it. Sure, we can google for "blogs about bush's immigration speech," but even that would likely turn up a bunch of blogs about random things, a bunch of pages about unrelated immigration topics, and so on.

The first step would be to be able to do something like "tell me all of the instances in which bush has changed his position on immigration" which would look at all the blogs which referenced the speech and found things relating to a change in the stated position. This would probably require natural language understanding of queries, but that might not even be necessary, as it could simply relate the words in the query with the words in the blogs.

The next thing needed would be for it to be anonymous (or at least have that possibility) and distributed. Aside from privacy concerns, there could be massive scalability problems, a la Twitter. There is the concept of a distributed hash table, but that only works for exact matches. Wikipedia to the rescue, with this (PDF) bit of magic from Berkley.

Personalized feeds



However, this would only be the beginning. Once you can respond to general natural-language queries like that, you could build up a list of what someone was interested in. RSS feeds are a fascinating concept and allow people to get a feed of news customized for their tastes, but it assumes that everything at a certain address will be interesting to me.

If you go to the RSS page on the New York times, you will see that they provide feeds for "Business," "Arts," "Automobiles," and so on. Going down further, there are feeds for just "Media & Advertising" "World Business" under business and "Design" and "Music" for arts, but nothing under automobiles. This is the kind of problem that David Weinberger was talking about: what if all "Arts" are pretty much the same for me, but I want to differentiate between "Foreign Cars" and "Domestic Cars," or even "Sports Cars" and "Trucks." Maybe I don't care about red cars at all, so I don't ever want to see a story about red cars in my feed.

The problem with RSS is that, even though it is a huge step forward for allowing us to keep track of many different sources of news at once, someone else has to decide how to split up the feeds. Most places give you an option of what you might want; the current system the New York Times is much better than having one giant "this is everything" feed, but it still has a ways to go. This could be done without having to rewrite our RSS clients by allowing the user of a site to set up a custom RSS URL from which to pull updates, but this would place a huge burden on the content providers and would not scale at all.

The solution



So after we have a good way for the user to tell the system what to get, we would want a way for the system to learn what the user likes, possibly using a Netflix-like recommendation system, and automatically pull down stories that the user likes.

If President Bush gives a new speech about policy for the Internets and 1,000 people blog about it, our little system should automatically sift through all of the posts, figure out which parts of the blogs would likely interest you, based on what you have told it you like in the past, and present you with some sort of reasonable compilation of the information, all without any user interaction. This would really unlock the potential of the Internet. Get hacking.