Friday, January 08, 2010

A CERN for Information Retrieval?

One major road block that information retrieval researchers presently face is a dearth of suitable live experimental testbeds. It's becoming increasingly apparent that offline data collection techniques (e.g., acquiring human judgments) cannot scale with growing demand and are insufficient for answering many of the emerging interesting research questions. After all, the ultimate goal is to design the best possible retrieval system for a particular search domain (e.g., medical, patent, etc), but asking humans to go through and label thousands or millions of documents for each domain seems impossibly inefficient.

The primary benefit of having live experimental information systems is the ability to try out different retrieval functions and subsequently analyze changes in user behavior (i.e., does the new retrieval function work better or worse?). Being able to validate under live experimental conditions can yield valuable new insights and exciting new progress in pushing the state-of-the-art in information retrieval.

At present, there exists a smattering of digital libraries which host search services that allow for online experimentation. For example, here at Cornell computer science, we manage the search service for arXiv, with the goal of developing new methods for interactive online experimentation and learning. However, for most academic researchers in information retrieval, it is very difficult to gain access to such a system.

This begs the question of whether the research community could/should build and maintain a centralized information system as a communal research tool -- much like how high energy physicists collaborate to run experiments at CERN, or how networking researchers collaborate to maintain PlanetLab. Such a system needs to be somehow centralized or consolidated in order to draw in a sufficiently large user base -- otherwise it would be uninteresting for analysis purposes.

This idea has been proposed in the past, though it's never really picked up much steam. The biggest drawback against developing such a system is, of course, the manpower required to get it off the ground. However, with the recent arrival of new platform technologies (e.g., Yahoo! BOSS), this barrier appears lower than ever.

Consider the following scenario:

1. Individual research groups use platforms such as BOSS to efficiently build their own retrieval functions and search servers. These servers have no direct users.

2. A centralized search portal has a large user base. Users are bucketed into different experimental conditions, and different research groups own different experiments. For each experiment, the portal directs queries to the appropriate research group's search server.

3. Researcher groups take turns running experiments and sharing the common user base. Some kind of cost-sharing scheme would probably be required as well.

The good news is that the above scenario doesn't sound impossibly difficult or time consuming. And of course, this scenario merely scratches the tip of the iceberg. After all, there is so much more we could be experimenting with beyond just the ranking function. But the more exotic our methods get, the more we need to be grounded in reality. That is why having such a centralized user-sharing experimental testbed seems so appealing.

No comments: