Tuesday, December 30, 2008

Automatically Generating Blog Post Labels

I recently converted to the new layout provided by Blogger. As part of the face-lift process, I thought it would be fun to label my posts. You can find all the labels I came up with on the right sidebar.

In reality, these labels represent clusters of blog posts that are similar to each other in some way. Posts can belong to multiple labels or belong to none. Some labels might be very similar (so many posts belong to both) -- some might even be nested labels (although that doesn't happen in this blog).

Furthermore, these labels are very view-dependent. For example, while I have separate labels for computer science and machine learning (which is primarily a computer science sub-area), I have one big label for science and technology in general. But given my particular interest in life extension, I decided that could be kept separate from the rest.

I started wondering about how one might automatically cluster my blog posts. It's possible that conventional clustering techniques will yield interesting results, but I find that unlikely. That might work OK on very structured and high volume blogs like Overcoming Bias. But conventional techniques ignore the fact that these clusters are view dependent, so we probably need to leverage some amount of background knowledge (and possibly the network structure) for most blogs. In addition, we need a clustering model which can handle the fact that some posts belong to multiple labels and some posts belong to no labels. There is also a time-dependency factor that might play a big role.

Why is this interesting? First of all, I think it's compelling enough to be able to discover a person's view (or projection) of the global topic/discipline hierarchy. It might also help us discover more about ourselves and how we view the world (since the clusters discovered by any algorithm will inevitably be different from our own manually generated labels). From a technical standpoint, depending on the approach, tackling this problem might yield insight on designing new clustering techniques or utilizing the background information of the internet.

Like most ideas, this one probably won't lead to any interesting results. But it's fun to think about. In case anyone is interested, I harvested my blog posts (with labels and all) and it's available here.


Parisa said...

Cool idea. I need to do this for my blog too.

Jess said...

Hey Yisong, I just looked up your blog again and thought I'd say hi. Looks like you're doing well. :)

Jess S (of IMSA fame)