Saturday, July 05, 2008

Machine Learning for Science

Those who know me know that I'm more or less constantly thinking about various aspects of science and technology. Lately, I've been thinking about a particularly remarkable trend that has emerged in recent years: using data analysis techniques (most notably machine learning) to develop very accurate, but difficult to interpret scientific models. It really struck home for me when I read a this article interviewing Vladimir Vapnik.

Professor Vapnik has been one of the leading visionaries in machine learning research, and remains very active to this day. In the aforementioned interview, he emphasized that science would be better served moving forward if scientists were willing to give up some degree of explain-ability in exchange for a model with superior predictive power. He quotes Einstein's remark that "when the number of factors coming into play is too large, scientific methods in most cases fail." This difference in philosophy closely mirrors one important cultural divide between statisticians and machine learning people. Since machine learning is inherently a computer science discipline, we tend to focus more on methods which can scale to very complex models and extremely large datasets. Statisticians, on the other hand, tend to prefer models that have very elegant designs which strive to explain the phenomena observed in the training data. As Vapnik points out,
Ten years ago, when statisticians did not buy our arguments, they did not do very well in solving high-dimensional problems. They introduced some heuristics, but this did not work well. Now they have adopted the ideas of statistical learning theory and this is an important achievement.

By coincidence, this month's edition of Wired Magazine headlines with a complementary article titled, "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete." The article discusses this very trend toward using data-driven models, with Google as the prime example:
For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn't pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right.
Researchers in computer architecture and systems design typically rely on ad-hoc heuristics to improve upon existing designs. These approaches at best offered incremental improvements. Using machine learning, the recently developed self-optimizing memory controllers has completely blown the competition out of the water.

At the core, machine learning is used to automatically design models (or theories) of certain observed phenomena. These models might not be very intelligible to humans, but their predictive powers cannot be overlooked. Each self-optimizing memory controller learns its own adaptive model for optimally scheduling memory instructions. Beyond its advertising algorithms, Google also automatically learns a model for ranking search results (i.e., a theory of how people search on the internet).

As I look into the future, I see machine learning and other data analysis techniques continue to benefit from the growing wealth of data and computational power. I expect that we'll be able to accurately predict for increasingly complex phenomena. As usual, I remain hopeful that we can effectively solve the problems related to aging, even if we don't know (at a deep level) why certain methods work. Who knows, maybe even the singularity is possible.

No comments: