Data Science for ScienceOne of the biggest themes at KDD 2015 was applying data science to support the sciences, which is something that's been on my mind a lot recently. Hugh Durrant-White gave a great keynote on applying machine learning to discovery processes in geology and ecology. One thing that jumped out of his talk was how challenging it is to develop models that are interpretable to domain experts. This issue is ameliorated in his settings because he largely focused on spatial models which are easier to visualize and interpret.
Susan Athey gave another keynote on the interplay between machine learning and causal inference in policy evaluation, which is an important issue for the sciences as well. I must admit, most of the talk went over my head, but there was some interesting debate after the talk about whether causality should be the goal or rather just more "robust" correlations (whatever that might mean).
I also really enjoyed the Data-Driven Science Panel, where the debate got quite heated at times. Two issues in particular stood out. First, what should be the role of machine learning and data mining experts in the ecosystem of data-driven science? One the one hand, computer scientists have historically had a large impact by developing systems and platforms that abstract away low-level complexity and empower the end user to be more productive. However, how to achieve such a solution in a data-rich world is a much messier (or at least different) type of endeavor. There are, of course, plenty of startups that address aspects of this problem, but a genuine scalable solution for science remains elusive.
A second issue that was raised was whether computational researchers have made much of a direct impact on the sciences. The particular area, raised by Tina Eliassi-Rad, is the social sciences. Machine learning and data mining have taken great interest in computational social science via studying large social networks. However, it is not clear to what extent computational researchers have directly made an impact to traditional social science fields. Of course, this issue is tied back to what the role of computational researchers should be. On the one hand, many social scientists do use tools made by computational people, so the indirect impact is quite clear. Does it really matter that there hasn't been much direct impact?
Update on MOOCsDaphne Koller gave a great keynote on the state of MOOCs and Coursera in particular. It seems that MOOCs nowadays are much smarter about their consumer base, and have diversified the way they deliver content and measure success for a wide range of students. For example, people now understand much better the different needs of college aspirants (who use MOOCs to supplicant high school & college education) versus young professionals (who use MOOCs to get ahead in their careers) versus those seeking vocational skills (which is very popular in less developed countries).
One striking omission that was pointed out during the Q&A was that MOOCs have mostly abandoned the pre-college demographic, especially before high school. In retrospect, this is not too surprising, in large part due to the very different requirements for primary and secondary education across different states and school districts. But it does put a damper on the current MOOC enthusiasm, since many problems with education start much earlier than college.
Lessons Learned from Large-Scale A/B TestingRon Kohavi gave a keynote on lessons learned from online A/B testing. The most interesting aspect of his talk was just how well-tuned the existing systems are. One symptom of a highly tuned system is that it becomes very difficult to intuit about whether certain modifications will increase or decrease the performance of the system (or have no effect). For example, he gave the audience a number of questions to the audience, such as: "Does increasing the description of the sponsored advertisements lead to increased overall clicks on ads?" Basically, the audience could not guess better than random. So the main lesson is to basically to follow the data and don't be to (emotionally) tied to your own intuitions when it comes to optimizing large complex industrial systems.
Sports Analytics WorkshopI co-organized the 2nd workshop on Large-Scale Sports Analytics. I tried to get more eSports into the workshop this year, but alas fell a bit short. Thorsten did give an interesting talk that used eSports data, although the phenomenon he was studying was not specific to eSports. In many ways, eSports is an even better test bed for sports analytics than traditional sports because game replays track literally everything.
Within the more traditional sports regimes, it's clear that access to data remains a large bottleneck. Many professional leagues are hoarding their data like gold, but sadly do not have the expertise leverage the data effectively. The situation actually seems better in Europe, where access to tracked soccer (sorry, futbol) games are relatively common. In the US, it seems like the data is only available to a select few sports analytics companies such as Second Spectrum. I'm hopeful that this situation will change in the near future as the various stake holders become more comfortable with the idea that it's not the raw data that has value, but the processed artifacts built on top of that data.
Interesting PapersThere were plenty of interesting research papers at KDD, of which I'll just list a few that I particularly liked.
A Decision Tree Framework for Spatiotemporal Sequence Prediction
by Taehwan Kim, Yisong Yue, Sarah Taylor, and Iain Matthews
I'll start with a shameless piece of self-advertising. In collaboration with Disney Research, we trained a model to generate visual speech, i.e., animate the lower face in response to audio or phonetic inputs. See the demo video below:
More details here.
Inside Jokes: Identifying Humorous Cartoon Captions
by Dafna Shahaf, Eric Horvitz, and Robert Mankoff
Probably the most interesting application at KDD was on studying the anatomy of a joke. While the results may not seem too surprising in retrospect (e.g., the punchline should be at the end of the joke), what was really cool was that the model could quantify if one joke was funnier than another joke (i.e., rank jokes).
Cinema Data Mining: The Smell of Fear
by Jörg Wicker, Nicolas Krauter, Bettina Derstorff, Christof Stönner, Efstratios Bourtsoukidis, Thomas Klüpfel, Jonathan Williams, and Stefan Kramer
This was a cool paper that studied how the exhaled organic particles vary in response to different emotions. The authors instrumented a movie theater's air circulation system with chemical sensors, and found that the chemicals you exhale are indicative of various emotions such as fear or amusement. The author repeatedly lamented the fact that they didn't do this for any erotic films, and so they don't know what the cinematic chemical signature of arousal would look like.
Who supported Obama in 2012? Ecological inference through distribution regression
by Seth Flaxman, Yu-Xiang Wang, and Alex Smola
This paper presents a new solution to the ecological inference problem of inferring individual level preferences from aggregate data. The primary data testbed were county-wise election outcomes and demographic data that reported at a different granularity or overlay. The main issue is how to estimate, e.g., female preference for one presidential candidate, using just these kinds of aggregate data.
Certifying and removing disparate impact
by Michael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian
Many people assume that, because algorithms are "objective" then they can't be biased or discriminatory. This assumption is invalid because the data or features themselves can be biased (cf. this interview with Cynthia Dwork). The authors of this paper propose a way to detect & remove bias in machine learning models that is tailored to the US legal definition of bias. The work is, of course, preliminary, but this paper was arguably the most thought provoking of the entire conference.
Edge-Weighted Personalized PageRank: Breaking A Decade-Old Performance Barrier
by Wenlei Xie, David Bindel, Alan Demers, and Johannes Gehrke
This paper proposes a reduction approach to personalized PageRank that yields a computational boost by several orders of magnitude, thus allowing, for the first time, personalized PageRank to be computed at interactive speeds. This paper was also the recipient of the best paper award.