Tuesday, June 09, 2009

Learning Interactively from Partial Relative Feedback

In recent years, much of computer science research has focused on developing algorithms which operate in the presence of uncertainty. Indeed, from a low-level technical standpoint, statistical machine learning is essentially about designing models which have strong predictive power as well as finding efficient algorithms for processing such models.

One such area is the so called multi-armed bandit problem, which is particularly appropriate for modeling many real-world problems where a learning must be done on-line (paying higher penalty for exploring worse strategies) and feedback is partial (does not know the outcome of strategies not explored). Example settings include clinical trials, advertising, adaptive routing over networks, and basically any scenario where one can only try out a limited number of strategies at a time.

A critical modeling choice used in conventional bandit formulations is to assume the availability of absolute, or cardinal, feedback. In many (and I believe most) settings, such feedback is either unavailable or difficult to obtain. For example, when conducting a blind taste test, rather than asking people to rate the quality of two food items separately on an absolute scale, one typically would ask participants which product they prefer in a side by side relative comparison. Generally speaking, I believe it is often much easier to obtain relative, or ordinal, feedback rather than absolute feedback.

In collaboration with Thorsten Joachims, Josef Broder and Bobby Kleinberg, we've been working a bandit formulation -- called the dueling bandits problem -- where feedback is relative rather than absolute. So rather than, as in conventional bandit settings, trying out one strategy and seeing how it works out (which assumes absolute feedback), we propose instead to compare a pair of strategies and obtain a (possibly noisy) observation regarding which strategy is superior (which thus assumes relative feedback). We have recently obtained some results for both the discrete and continuous cases. The two papers will be presented later this month at COLT and ICML, respectively.

Our motivating application is the problem of learning retrieval models for search services. Search users leave behind an abundant amount of implicit feedback, most notably the results they click on. However, simply counting clicks (i.e., using clicks as absolute feedback) will lead to very biased results. A recent paper by Filip Radlinski, Madhu Kurup and Thorsten Joachims proposes a new way of comparing two different retrieval functions in an interleaving experiment that is minimally obstructive to the end users, and also allows clicks to reliably indicate which retrieval function is superior (i.e., relative feedback). This development in turn motivated our work on the dueling bandits problem. The interleaving experiment can then serve as an implementation of the comparison oracle required for obtaining relative feedback.

There is still a lot of work left to do, including how to make the methods more generally practical as well as developing more sophisticated and realistic formulations. It seems to me that a necessary (but not sufficient) condition for general AI is the ability to intelligently (and possibly interactively) explore an unknown environment. While the application setting I'm focusing on is painfully shallow compared to that lofty goal, I'm nonetheless pleased by this direction of research.

No comments: