Monday, September 07, 2015

Thoughts on KDD 2015

Last month I attended KDD 2015 in beautiful Sydney, Australia. For those who don't know, KDD is the premier international conference for applied machine learning & data mining, and is often the venue for some of the most interesting data analysis research projects. Despite concerns that KDD 2015 would be a let down after KDD 2014 was such a great success in New York City, overall KDD 2015 was a fantastic conference, with an excellent lineup of invited speakers and plenty of interesting papers. Congratulations also to my PhD advisor Thorsten Joachims, who not only did a great job as PC Co-Chair, but also was the recipient of a Test of Time Award for his work on Optimizing Search Engines using Clickthrough Data.

Data Science for Science

One of the biggest themes at KDD 2015 was applying data science to support the sciences, which is something that's been on my mind a lot recently. Hugh Durrant-White gave a great keynote on applying machine learning to discovery processes in geology and ecology. One thing that jumped out of his talk was how challenging it is to develop models that are interpretable to domain experts. This issue is ameliorated in his settings because he largely focused on spatial models which are easier to visualize and interpret.

Susan Athey gave another keynote on the interplay between machine learning and causal inference in policy evaluation, which is an important issue for the sciences as well. I must admit, most of the talk went over my head, but there was some interesting debate after the talk about whether causality should be the goal or rather just more "robust" correlations (whatever that might mean).

I also really enjoyed the Data-Driven Science Panel, where the debate got quite heated at times. Two issues in particular stood out. First, what should be the role of machine learning and data mining experts in the ecosystem of data-driven science? One the one hand, computer scientists have historically had a large impact by developing systems and platforms that abstract away low-level complexity and empower the end user to be more productive. However, how to achieve such a solution in a data-rich world is a much messier (or at least different) type of endeavor. There are, of course, plenty of startups that address aspects of this problem, but a genuine scalable solution for science remains elusive.

A second issue that was raised was whether computational researchers have made much of a direct impact on the sciences. The particular area, raised by Tina Eliassi-Rad, is the social sciences. Machine learning and data mining have taken great interest in computational social science via studying large social networks. However, it is not clear to what extent computational researchers have directly made an impact to traditional social science fields. Of course, this issue is tied back to what the role of computational researchers should be. On the one hand, many social scientists do use tools made by computational people, so the indirect impact is quite clear. Does it really matter that there hasn't been much direct impact?

Update on MOOCs

Daphne Koller gave a great keynote on the state of MOOCs and Coursera in particular. It seems that MOOCs nowadays are much smarter about their consumer base, and have diversified the way they deliver content and measure success for a wide range of students. For example, people now understand much better the different needs of college aspirants (who use MOOCs to supplicant high school & college education) versus young professionals (who use MOOCs to get ahead in their careers) versus those seeking vocational skills (which is very popular in less developed countries).

One striking omission that was pointed out during the Q&A was that MOOCs have mostly abandoned the pre-college demographic, especially before high school. In retrospect, this is not too surprising, in large part due to the very different requirements for primary and secondary education across different states and school districts. But it does put a damper on the current MOOC enthusiasm, since many problems with education start much earlier than college.

Lessons Learned from Large-Scale A/B Testing

Ron Kohavi gave a keynote on lessons learned from online A/B testing. The most interesting aspect of his talk was just how well-tuned the existing systems are. One symptom of a highly tuned system is that it becomes very difficult to intuit about whether certain modifications will increase or decrease the performance of the system (or have no effect). For example, he gave the audience a number of questions to the audience, such as: "Does increasing the description of the sponsored advertisements lead to increased overall clicks on ads?" Basically, the audience could not guess better than random. So the main lesson is to basically to follow the data and don't be to (emotionally) tied to your own intuitions when it comes to optimizing large complex industrial systems.

Sports Analytics Workshop

I co-organized the 2nd workshop on Large-Scale Sports Analytics. I tried to get more eSports into the workshop this year, but alas fell a bit short. Thorsten did give an interesting talk that used eSports data, although the phenomenon he was studying was not specific to eSports. In many ways, eSports is an even better test bed for sports analytics than traditional sports because game replays track literally everything.

Within the more traditional sports regimes, it's clear that access to data remains a large bottleneck. Many professional leagues are hoarding their data like gold, but sadly do not have the expertise leverage the data effectively. The situation actually seems better in Europe, where access to tracked soccer (sorry, futbol) games are relatively common. In the US, it seems like the data is only available to a select few sports analytics companies such as Second Spectrum. I'm hopeful that this situation will change in the near future as the various stake holders become more comfortable with the idea that it's not the raw data that has value, but the processed artifacts built on top of that data.

Interesting Papers

There were plenty of interesting research papers at KDD, of which I'll just list a few that I particularly liked.

A Decision Tree Framework for Spatiotemporal Sequence Prediction
by Taehwan Kim, Yisong Yue, Sarah Taylor, and Iain Matthews
I'll start with a shameless piece of self-advertising. In collaboration with Disney Research, we trained a model to generate visual speech, i.e., animate the lower face in response to audio or phonetic inputs. See the demo video below:

More details here.

Inside Jokes: Identifying Humorous Cartoon Captions
by Dafna Shahaf, Eric Horvitz, and Robert Mankoff
Probably the most interesting application at KDD was on studying the anatomy of a joke. While the results may not seem too surprising in retrospect (e.g., the punchline should be at the end of the joke), what was really cool was that the model could quantify if one joke was funnier than another joke (i.e., rank jokes).

Cinema Data Mining: The Smell of Fear
by Jörg Wicker, Nicolas Krauter, Bettina Derstorff, Christof Stönner, Efstratios Bourtsoukidis, Thomas Klüpfel, Jonathan Williams, and Stefan Kramer
This was a cool paper that studied how the exhaled organic particles vary in response to different emotions. The authors instrumented a movie theater's air circulation system with chemical sensors, and found that the chemicals you exhale are indicative of various emotions such as fear or amusement. The author repeatedly lamented the fact that they didn't do this for any erotic films, and so they don't know what the cinematic chemical signature of arousal would look like.

Who supported Obama in 2012? Ecological inference through distribution regression
by Seth Flaxman, Yu-Xiang Wang, and Alex Smola
This paper presents a new solution to the ecological inference problem of inferring individual level preferences from aggregate data. The primary data testbed were county-wise election outcomes and demographic data that reported at a different granularity or overlay. The main issue is how to estimate, e.g., female preference for one presidential candidate, using just these kinds of aggregate data.

Certifying and removing disparate impact
by Michael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian
Many people assume that, because algorithms are "objective" then they can't be biased or discriminatory. This assumption is invalid because the data or features themselves can be biased (cf. this interview with Cynthia Dwork). The authors of this paper propose a way to detect & remove bias in machine learning models that is tailored to the US legal definition of bias. The work is, of course, preliminary, but this paper was arguably the most thought provoking of the entire conference.

Edge-Weighted Personalized PageRank: Breaking A Decade-Old Performance Barrier
by Wenlei Xie, David Bindel, Alan Demers, and Johannes Gehrke
This paper proposes a reduction approach to personalized PageRank that yields a computational boost by several orders of magnitude, thus allowing, for the first time, personalized PageRank to be computed at interactive speeds. This paper was also the recipient of the best paper award.

Thursday, April 09, 2015

KDD 2015 Workshop on Large-Scale Sports Analytics

We are pleased to announce that the KDD Workshop on Large-Scale Sports Analytics will be taking place in Sydney this year on August the 10th at KDD 2015. Similar to last year, it will be a full day workshop consisting of invited speakers as well as poster sessions for submitted papers. A call for paper submissions is below.

=== Call for Submissions ===
When: August 10th, 2015
Where: Sydney, Australia
Website: http://large-scale-sports-analytics.org/

Description:

Virtually every aspect of sports analytics is now entering the “Big Data” phase, and the interest in effectively mining, modeling, and learning from such data has also been correspondingly growing. Relevant data sources include detailed play-by-play game logs, tracking data, physiological sensor data to monitor the health of players, social media and text-based content, and video recordings of games.

The objective of this workshop is to bring together researchers and analysts from academia and industry who work in sports analytics, data mining and machine learning. We hope to enable meaningful discussions about state-of-the-art in sports analytics research, and how it might be improved upon.

We seek poster submissions (which can be both preliminary research as well as recently published work) on topics including but not limited to:
* Spatiotemporal modeling
* Video, text and social media analysis
* Feature selection and dimensionality reduction
* Feature learning and latent factor models
* Computational rationality
* Real-time predictive modeling
* Interactive analysis & visualization tools
* Sensor technology and reliability
* Labeling and annotation of events/activities/tactics
* Real-time/deployed analytical systems
* Knowledge discovery of player/team/league behaviors
* Game Theory
* eSports


Submission Details:
Poster submissions should be extended abstracts no more than 4 pages in length (in KDD format, do not need to be anonymous). Extended abstracts should be submitted by June 5th 11:59 PM PDT. Details can be found at:

http://www.large-scale-sports-analytics.org/Large-Scale-Sports-Analytics/Submissions.html

Important Dates:
Submission - 5th June 2015 11:59 PM PDT
Notification - 30th June 2015
Workshop - 10th August 2015

Organizers:
Patrick Lucey (Disney Research) (patrick.lucey@disneyresearch.com)
Yisong Yue (Caltech) (yyue@caltech.edu)
Jenna Wiens (University of Michigan) (wiensj@umich.edu)
Stuart Morgan (Australian Institute of Sport) (stuart.morgan@ausport.gov.au)

Tuesday, January 13, 2015

A Brief Overview of Deep Learning

(This is a guest post by Ilya Sutskever on the intuition behind deep learning as well as some very useful practical advice. Many thanks to Ilya for such a heroic effort!)

Deep Learning is really popular these days. Big and small companies are getting into it and making money off it. It’s hot. There is some substance to the hype, too: large deep neural networks achieve the best results on speech recognition, visual object recognition, and several language related tasks, such as machine translation and language modeling.

But why? What’s so special about deep learning? (from now on, we shall use the term Large Deep Neural Networks --- LDNN --- which is what the vaguer term “Deep Learning” mostly refers to). Why does it work now, and how does it differ from neural networks of old? Finally, suppose you want to train an LDNN. Rumor has it that it’s very difficult to do so, that it is “black magic” that requires years of experience. And while it is true that experience helps quite a bit, the amount of “trickery” is surprisingly limited ---- one needs be on the lookout for only a small number well-known pitfalls. Also, there are many open-source implementations of various state-of-the-art neural networks (c.f. Caffe, cuda-covnet, Torch, Theano), which makes it much easier to learn all the details needed to make it work.
Why Does Deep Learning Work?
It is clear that, to solve hard problems, we must use powerful models. This statement is obvious. Indeed, if a model is not powerful, then there is absolutely no chance that it can succeed in solving a hard problem, no matter how good the learning algorithm is.

The other necessary condition for success is that our model is trainable. That too is obvious, for if we cannot train our model, then its power is useless --- it will never amount to anything, and great results will not be achieved. The model will forever remain in a state of unrealized potential.

Fortunately, LDNNs are both trainable and powerful.
Why Are LDNNs Powerful?
When I talk about LDNNs, I’m talking about 10-20 layer neural networks (because this is what can be trained with today’s algorithms). I can provide a few ways of looking at LDNNs that will illuminate the reason they can do as well as they do.

  • Conventional statistical models learn simple patterns or clusters. In contrast, LDNNs learn computation, albeit a massively parallel computation with a modest number of steps. Indeed, this is the key difference between LDNNs and other statistical models.

  • To elaborate further: it is well known that any algorithm can be implemented by an appropriate very deep circuit (with a layer for each timestep of the algorithm’s execution -- one example). What’s more, the deeper the circuit, the more expensive are the algorithms that can be implemented by the circuit (in terms of runtime). And given that neural networks are circuits as well, deeper neural networks can implement algorithms with more steps ---- which is why depth = more power.
    • N.B.: It is easy to see that a single neuron of a neural network can compute the conjunction of its inputs, or the disjunction of its inputs, by simply setting their connections to appropriate values.

  • Surprisingly, neural networks are actually more efficient than boolean circuits. By more efficient, I mean that a fairly shallow DNN can solve problems that require many more layers of boolean circuits. For a specific example, consider the highly surprising fact that a DNN with 2 hidden layer and a modest number of units can sort N N-bit numbers! I found the result shocking when I heard about it, so I implemented a small neural network and trained it to sort 10 6-bit numbers, which was easy to do to my surprise. It is impossible to sort N N-bit numbers with a boolean circuit that has two hidden layers and that are not gigantic.
    • The reason DNNs are more efficient than boolean circuits is because neurons perform a threshold operation, which cannot be done with a tiny boolean circuit.

  • Finally, human neurons are slow yet humans can perform lots of complicated tasks in a fraction of a second. More specifically, it is well-known that a human neuron fires no more than 100 times per second. This means that, if a human can solve a problem in 0.1 seconds, then our neurons have enough time to fire only 10 times --- definitely not much more than that. It therefore follows that a large neural network with 10 layers can do anything a human can in 0.1 seconds.

  • This is not scientific fact since it is conceivable that real neurons are much more powerful than artificial neurons, but real neurons may also turn out to be much less powerful than artificial neurons. In any event, the above is certainly a plausible hypothesis.

  • This is interesting because humans can solve many complicated perception problems in 0.1 seconds --- for example, humans can recognize the identity of an object that’s in front of them, recognize a face, recognize an emotion, and understand speech in a fraction of a second. In fact, if there exists even just one person in the entire world who has achieved an uncanny expertise in performing a highly complex task of some sort in a fraction of a second, then this is highly convincing evidence that a large DNN could solve the same task --- if only its connections are set to the appropriate values.

  • But won’t the neural network need to be huge? Maybe. But we definitely know that it won’t have to be exponentially large ---- simply because the brain isn’t exponentially large! And if human neurons turn out to be noisy (for example), which means that many human neurons are required to implement a single real-valued operation that can be done using just one artificial neuron, then the number of neurons required by our DNNs to match a human after 0.1 seconds is greatly diminished.

These four arguments suggest (strongly, in my opinion), that for a very wide variety of problems, there exists a setting of the connections of a LDNN that basically solves the problem. Crucially, the number of units required to solve these problems is far from exponential --- on the contrary, the number of units required is often so “small” that it is even possible, using current hardware, to train a network that achieves super-high performance on the task of interest. It is this last point which is so important, and requires additional elaboration:

  • We know that most machine learning algorithms are consistent: that is, they will solve the problem given enough data. But consistency generally requires an exponentially large amount of data. For example, the nearest neighbor algorithm can definitely solve any problem by memorizing the correct answer to every conceivable input. The same is true for a support vector machine --- we’d have a support vector for almost every possible training case for very hard problems. The same is also true for a neural network with a single hidden layer: if we have a neuron for every conceivable training case, so that neuron fires for that training case and but not for any other, then we could also learn and represent every conceivable function from inputs to outputs. Everything can be done given exponential resources, but it is never ever going to be relevant in our limited physical universe.

  • And it is in this point that LDNNs differ from previous methods: we can be reasonably certain that a large but not huge LDNN will achieve good results on a surprising variety of problems that we may want to solve. If a problem can be solved by a human in a fraction of a second, then we have a very non-exponential super-pessimistic upper bound on the size of the smallest neural network that can achieve very good performance.

  • But I must admit that it is impossible to predict whether a given problem will be solvable by a deep neural network ahead of time, although it is often possible to tell whenever we know that a similar problem can be solved by an LDNN of a manageable size.

So that’s it, then. Given a problem, such as visual object recognition, all we need is to train a giant convolutional neural network with 50 layers. Clearly a giant convnet with 50 layers can be configured to achieve human-level performance on object recognition --- right? So we simply need to find these weights. Once once we do, the problem is solved.
Learning.
What is learning? Learning is the problem of finding a setting of the neural network’s weights that achieves the best possible results on our training data. In other words, we want to “push” the information from the labelled data into the parameters so that the resulting neural network will solve our problem.

The success of Deep Learning hinges on a very fortunate fact: that well-tuned and carefully-initialized stochastic gradient descent (SGD) can train LDNNs on problems that occur in practice. It is not a trivial fact since the training error of a neural network as a function of its weights is highly non-convex. And when it comes to non-convex optimization, we were taught that all bets are off. Only convex is good, and non-convex is bad. And yet, somehow, SGD seems to be very good at training those large deep neural networks on the tasks that we care about. The problem of training neural networks is NP-hard, and in fact there exists a family of datasets such that the problem of finding the best neural network with three hidden units is NP-hard. And yet, SGD just solves it in practice. This is the main pillar of deep learning.

We can say fairly confidently that successful LDNN training relies on the “easy” correlation in the data, which allows learning to bootstrap itself towards the more “complicated” correlations in the data. I have done an experiment that seems to support this claim: I found that training a neural network to solve the parity problem is hard. I was able to train the network to solve parity for 25 bits, 29 bits, but never for 31 bits (by the way, I am not claiming that learning parity is impossible for over 30 bits --- only that I didn’t succeed in doing so). Now, we know that parity is a highly unstable problem that doesn’t have any linear correlations: every linear function of the inputs is completely uncorrelated with the output, which is a problem for neural networks since they are mostly linear at initialization time (so perhaps I should’ve used larger initial weights? I will discuss the topic of weight initialization later in the text). So my hypothesis (which is shared by many other scientists) is that neural networks start their learning process by noticing the most “blatant” correlations between the input and the output, and once they notice them they introduce several hidden units to detect them, which enables the neural network to see more complicated correlations. Etc. The process goes on. I imagine some sort of a “spectrum” of correlations --- both easy and hard, and the network jumps from a correlation to a more complicated correlation, much like an opportunistic mountain climber.
Generalization.
While it is very difficult to say anything specific about the precise nature of the optimization of neural networks (except near a local minimum where everything becomes convex and uninteresting), we can say something nontrivial and specific about generalization.

And the thing we can say is the following: in his famous 1984 paper called "A Theory of the Learnable", Valiant proved, roughly speaking, that if you have a finite number of functions, say N, then every training error will be close to every test error once you have more than log N training cases by a small constant factor. Clearly, if every training error is close to its test error, then overfitting is basically impossible (overfitting occurs when the gap between the training and the test error is large). (I am also told that this result was given in Vapnik’s book as small exercise). This theorem is easy to prove but I won’t do it here.

But this very simple result has a genuine implication to any implementation of neural networks. Suppose I have a neural network with N parameters. Each parameter will be a float32. So a neural network is specified with 32N bits, which means that we have no more than 232N distinct neural networks, and probably much less. This means that we won’t overfit much once we have more than 32N training cases. Which is nice. It means that it’s theoretically OK to count parameters. What’s more, if we are quite confident that each weight only requires 4 bits (say), and that everything else is just noise, then we can be fairly confident that the number of training cases will be a small constant factor of 4N rather than 32N.
The Conclusion:
If we want to solve a hard problem we probably need a LDNN, which has many parameters. So we need a large high-quality labelled training set to make sure that it has enough information to specify all the network’s connections. And once we get that training set, we should run SGD on it until the network solves the problem. And it probably will, if our neural network is large and deep.
What Changed Since the 80s?
In the old days, people believed that neural networks could “solve everything”. Why couldn’t they do it in the past? There are several reasons.

  • Computers were slow. So the neural networks of past were tiny. And tiny neural networks cannot achieve very high performance on anything. In other words, small neural networks are not powerful.

  • Datasets were small. So even if it was somehow magically possible to train LDNNs, there were no large datasets that had enough information to constrain their numerous parameters. So failure was inevitable.

  • Nobody knew how to train deep nets. Deep networks are important. The current best object recognition networks have between 20 and 25 successive layers of convolutions. A 2 layer neural network cannot do anything good on object recognition. Yet back in the day everyone was very sure that deep nets cannot be trained with SGD, since that would’ve been too good to be true!

It’s funny how science progresses, and how easy it is to train deep neural networks, especially in retrospect.
Practical Advice.
Ok. So you’re sold. You’re convinced that LDNNs are the present and the future and you want to train it. But rumor has it that it’s so hard, so difficult… or is it? The reality is that it used to be hard, but now the community has consolidated its knowledge and realized that training neural networks is easy as long as you keep the following in mind.

Here is a summary of the community’s knowledge of what’s important and what to look after:
  • Get the data: Make sure that you have a high-quality dataset of input-output examples that is large, representative, and has relatively clean labels. Learning is completely impossible without such a dataset.

  • Preprocessing: it is essential to center the data so that its mean is zero and so that the variance of each of its dimensions is one. Sometimes, when the input dimension varies by orders of magnitude, it is better to take the log(1 + x) of that dimension. Basically, it’s important to find a faithful encoding of the input with zero mean and sensibly bounded dimensions. Doing so makes learning work much better. This is the case because the weights are updated by the formula: change in wij \propto xidL/dyj (w denotes the weights from layer x to layer y, and L is the loss function). If the average value of the x’s is large (say, 100), then the weight updates will be very large and correlated, which makes learning bad and slow. Keeping things zero-mean and with small variance simply makes everything work much better.

  • Minibatches: Use minibatches. Modern computers cannot be efficient if you process one training case at a time. It is vastly more efficient to train the network on minibatches of 128 examples, because doing so will result in massively greater throughput. It would actually be nice to use minibatches of size 1, and they would probably result in improved performance and lower overfitting; but the benefit of doing so is outweighed the massive computational gains provided by minibatches. But don’t use very large minibatches because they tend to work less well and overfit more. So the practical recommendation is: use the smaller minibatch that runs efficiently on your machine.

  • Gradient normalization: Divide the gradient by minibatch size. This is a good idea because of the following pleasant property: you won’t need to change the learning rate (not too much, anyway), if you double the minibatch size (or halve it).

  • Learning rate schedule: Start with a normal-sized learning rate (LR) and reduce it towards the end.
    • A typical value of the LR is 0.1. Amazingly, 0.1 is a good value of the learning rate for a large number of neural networks problems. Learning rates frequently tend to be smaller but rarely much larger.
    • Use a validation set ---- a subset of the training set on which we don’t train --- to decide when to lower the learning rate and when to stop training (e.g., when error on the validation set starts to increase).
    • A practical suggestion for a learning rate schedule: if you see that you stopped making progress on the validation set, divide the LR by 2 (or by 5), and keep going. Eventually, the LR will become very small, at which point you will stop your training. Doing so helps ensure that you won’t be (over-)fitting the training data at the detriment of validation performance, which happens easily and often. Also, lowering the LR is important, and the above recipe provides a useful approach to controlling via the validation set.

  • But most importantly, worry about the Learning Rate. One useful idea used by some researchers (e.g., Alex Krizhevsky) is to monitor the ratio between the update norm and the weight norm. This ratio should be at around 10-3. If it is much smaller then learning will probably be too slow, and if it is much larger then learning will be unstable and will probably fail.

  • Weight initialization. Worry about the random initialization of the weights at the start of learning.
    • If you are lazy, it is usually enough to do something like 0.02 * randn(num_params). A value at this scale tends to work surprisingly well over many different problems. Of course, smaller (or larger) values are also worth trying.
    • If it doesn’t work well (say your neural network architecture is unusual and/or very deep), then you should initialize each weight matrix with the init_scale / sqrt(layer_width) * randn. In this case init_scale should be set to 0.1 or 1, or something like that.
    • Random initialization is super important for deep and recurrent nets. If you don’t get it right, then it’ll look like the network doesn’t learn anything at all. But we know that neural networks learn once the conditions are set.
    • Fun story: researchers believed, for many years, that SGD cannot train deep neural networks from random initializations. Every time they would try it, it wouldn’t work. Embarrassingly, they did not succeed because they used the “small random weights” for the initialization, which works great for shallow nets but simply doesn’t work for deep nets at all. When the nets are deep, the many weight matrices all multiply each other, so the effect of a suboptimal scale is amplified.
    • But if your net is shallow, you can afford to be less careful with the random initialization, since SGD will just find a way to fix it.
    You’re now informed. Worry and care about your initialization. Try many different kinds of initialization. This effort will pay off. If the net doesn’t work at all (i.e., never “gets off the ground”), keep applying pressure to the random initialization. It’s the right thing to do.

  • If you are training RNNs or LSTMs, use a hard constraint over the norm of the gradient (remember that the gradient has been divided by batch size). Something like 15 or 5 works well in practice in my own experiments. Take your gradient, divide it by the size of the minibatch, and check if its norm exceeds 15 (or 5). If it does, then shrink it until it is 15 (or 5). This one little trick plays a huge difference in the training of RNNs and LSTMs, where otherwise the exploding gradient can cause learning to fail and force you to use a puny learning rate like 1e-6 which is too small to be useful.

  • Numerical gradient checking: If you are not using Theano or Torch, you’ll be probably implementing your own gradients. It is easy to make a mistake when we implement a gradient, so it is absolutely critical to use numerical gradient checking. Doing so will give you a complete peace of mind and confidence in your code. You will know that you can invest effort in tuning the hyperparameters (such as the learning rate and the initialization) and be sure that your efforts are channeled in the right direction.

  • If you are using LSTMs and you want to train them on problems with very long range dependencies, you should initialize the biases of the forget gates of the LSTMs to large values. By default, the forget gates are the sigmoids of their total input, and when the weights are small, the forget gate is set to 0.5, which is adequate for some but not all problems. This is the one non-obvious caveat about the initialization of the LSTM.

  • Data augmentation: be creative, and find ways to algorithmically increase the number of training cases that are in your disposal. If you have images, then you should translate and rotate them; if you have speech, you should combine clean speech with all types of random noise; etc. Data augmentation is an art (unless you’re dealing with images). Use common sense.

  • Dropout. Dropout provides an easy way to improve performance. It’s trivial to implement and there’s little reason to not do it. Remember to tune the dropout probability, and to not forget to turn off Dropout and to multiply the weights by (namely by 1-dropout probability) at test time. Also, be sure to train the network for longer. Unlike normal training, where the validation error often starts increasing after prolonged training, dropout nets keep getting better and better the longer you train them. So be patient.

  • Ensembling. Train 10 neural networks and average their predictions. It’s a fairly trivial technique that results in easy, sizeable performance improvements. One may be mystified as to why averaging helps so much, but there is a simple reason for the effectiveness of averaging. Suppose that two classifiers have an error rate of 70%. Then, when they agree they are right. But when they disagree, one of them is often right, so now the average prediction will place much more weight on the correct answer. The effect will be especially strong whenever the network is confident when it’s right and unconfident when it’s wrong.

I am pretty sure that I haven’t forgotten anything. The above 13 points cover literally everything that’s needed in order to train LDNNs successfully.
So, to Summarize:
  • LDNNs are powerful.
  • LDNNs are trainable if we have a very fast computer.
  • So if we have a very large high-quality dataset, we can find the best LDNN for the task.
  • Which will solve the problem, or at least come close to solving it.
The End.
But what does the future hold? Predicting the future is obviously hard, but in general, models that do even more computation will probably be very good. The Neural Turing Machine is a very important step in this direction. Other problems include unsupervised learning, which is completely mysterious and incomprehensible in my opinion as of 8 Jan 2015. Learning very complicated “things” from data without supervision would be nice. All these problems require extensive research.

Friday, December 19, 2014

Thoughts on NIPS 2014

NIPS 2014 happened last week, and what a great conference it was. Lots of great papers, workshops, and invited talks.
The NIPS Experiment
In contrast to previous years, the most talked about thing from NIPS this year was not any new machine learning approach, but rather a reviewing experiment called the NIPS Experiment.

In a nutshell, about 10% of submissions were reviewed independently by two sets of reviewers (including two different Area Chairs). The goal of the NIPS Experiment was to assess to what extent reviewers agreed on accept/reject decisions. The outcome of the experiment has been a challenge to interpret properly.

The provocative and thought provoking blog post by Eric Price has garnered the most attention from the broader scientific community. Basically, one reasonable way of interpreting the NIPS Experiment results is that of the papers accepted for publication at NIPS 2014, roughly half of them would be rejected if they were reviewed again by a different set of reviewers. This, of course, highlights the degree of subjectivity and randomness (likely exacerbated by sub-optimal reviewing) inherent in reviewing for a such a broad field as machine learning.

The most common way to analyze this is from a certain viewpoint about fairness. I.e., if we had a budget for K papers, did the top K submissions get published? From that standpoint, the answer seems to be a resounding no, no matter how you slice it. One can argue about the degree of unfairness, which is a much murkier subject.
Alternative Viewpoint via Regret Minimization
However, as echoed in a blog post by Bert Huang, NIPS was AWESOME this year. The poster sessions had lots of great papers, and the oral presentations were good.

So I'd like to offer a different viewpoint about NIPS, one based on regret minimization. Let's assume that the accepted papers that were more likely to be rejected in a second review are "borderline" papers (seems like a reasonable assumption, but perhaps there are arguments against it). Then, had we swapped out a bunch of borderline papers with other borderline papers that got rejected, would the quality of the conference have been that much better?

In other words, given a budget of K papers to accept, what is the collective quality of K papers actually accepted versus the quality of the "optimal" set of K papers we should've accepted? It's conceivable that the regret on quality difference could be quite low despite the paper overlap being substantially different.

One might even argue, as alluded to here, that long-term regret minimization (i.e., reviewing for NIPS over many years) requires some amount of randomness and/or disagreement between reviewers. Otherwise, there could be a more serious risk of group-think or intellectual inbreeding that can cause the field to stagnate.

Not sure to what extent this viewpoint is appropriate. For instance, NIPS is also a venue by which junior researchers become established in the field. Having a significant amount of randomness in the reviewing process can definitely be detrimental to the morale and career prospects of junior researchers.
On to the Actual Papers
There were many great papers at NIPS this year. Here are a few that caught my eye:

Sequence to Sequence Learning with Neural Networks
by Ilya Sutskever, Oriol Vinyals & Quoc Le.
Ilya gave, hands down, the best talk at NIPS this year. Ever since it started becoming popular, Deep Learning has carried with it the idea that only Geoff Hinton & company could make them work well. Ilya spent most of his talk describing how this is not the case anymore. He also showed how to incorporate a type of gradient momentum called Long Short-Term Memory in order to do sequence-to-sequence prediction with deep neural networks.

Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics
by Sergey Levine & Pieter Abbeel.
This paper combined reinforcement learning and neural networks in order to do policy search. What's shocking about this approach is how few training examples they needed to train a neural network. Granted, the neural network wasn't very deep, but still, the low amount of training data is quite surprising.

Learning to Optimize via Information-Directed Sampling
by Dan Russo & Benjamin Van Roy.
Dan Russo has been doing some great work recently on analyzing bandit/MDP algorithms and proposing new algorithms. This paper proposes the first (mostly) fundamentally new bandit algorithm design philosophy that I've seen in a while. It's not clear yet how to make this algorithm practical in a wide range of complex domains, but it's definitely exciting to think about.

Submodular meets Structured: Finding Diverse Subsets in Exponentially-Large Structured Item Sets
by Adarsh Prasad, Stefanie Jegelka & Dhruv Batra.
This paper deals with how to do submodular maximization when the ground set is exponentially large. This paper exploits specific structure in the ground set, e.g., it can be solved via cooperative cuts, in order to arrive an efficient solution. It would be interesting to try to learn the diversity/submodular objective function rather than hand-craft a relatively simple one (from a modeling perspective).

From MAP to Marginals: Variational Inference in Bayesian Submodular Models
by Josip Djolonga & Andreas Krause.
Log submodular models are a new family of probabilistic models that generalizes things like associative Markov random fields. This paper shows how to perform variational marginal inference on log submodular functions, which might be wildly intractable when viewed through the lens of conventional graphical models (e.g., very large factors that obey a submodular structure). Very cool stuff.

Non-convex Robust PCA
by Praneeth Netrapalli, Niranjan U N, Sujay Sanghavi, Animashree Anandkumar & Prateek Jain.
This paper gives a very efficient and provably optimal approach for robust PCA, where a matrix is assumed to be low-rank but except for a few sparse components. This optimization problem is non-convex, and convex relaxations can often give sub-optimal results. They also have a cool demo.

How transferable are features in deep neural networks?
by Jason Yosinski, Jeff Clune, Yoshua Bengio & Hod Lipson.
Along with a scientific study on the transferability of neural network features, Jason Yosinski also developed a cool demo that can visualize the various hidden layers of a deep neural network.

Conditional Random Field Autoencoders for Unsupervised Structured Prediction
by Waleed Ammar, Chris Dyer & Noah A. Smith.
This paper gives a surprisingly efficient approach for learning unsupervised auto-encoders that avoids making overly restrictive independence assumptions. The approach is based off CRFs. I wonder if one can do this with a more expressive model class such as structured decision trees.

A* Sampling
by Chris J. Maddison, Daniel Tarlow & Tom Minka.
I admit that I don't really understand what's going on in this paper. But it seems like it's doing something quite new so there are perhaps many interesting connections to be made here. This paper also won one of the Outstanding Paper Awards at NIPS this year.

Monday, October 13, 2014

How to Write an Academic Research Statement

It's that time of year when junior researchers are preparing applications for academic positions. One of the largest uncertainties that many people have is how to properly write a research statement that is typically part of the application package. This post contains my thoughts on what a good Computer Science research statement should look like when applying to US and Canadian universities. Please keep mind, though, that everyone's research profile is different so what worked for me may not exactly work for you.

To be perfectly honest, the research statement is not the most important part of your application package -- the letters of recommendation are. Your letter writers are accomplished researchers in your field of study, and can place your work in context as well as compare you to other researchers (when they were at your current career stage). Whether or not a hiring committee seriously considers you for an onsite interview is largely a function of your recommendation letters and any other research reputation you've managed acquire while disseminating your work (**).

Nonetheless, the research statement is still important, especially once the hiring committee gets down to a short list and are basically trying to figure out which of the strong candidates seem like they would be the most interesting and impactful additions to the department.

For reference, here's my research statement when I was on the job market in Fall 2012. I want to emphasize a few points:

(1) As my history teacher Dr. Skinner would always say: "Pithy and Erudition!" In other words, keep it short and to the point. You have to optimize for the case when someone with very limited time is doing a quick read of your research statement. No convoluted sentences, and no long paragraphs. As a general rule, I'd say it's probably too long if it's more than 3 pages. Optimize your wording to be as concise as possible.

(2) Tell a Story. Academics like to get excited by the potential of new research directions -- after all that's why many of us chose to pursue this line of work. So make sure you have an overarching vision in mind. For me, I chose to talk about machine learning with humans in the loop as my central theme. During my onsite interviews, I was repeatedly asked to describe what my NSF CAREER proposal would look like. The purpose of the question is so that the interviewer can get a sense of my research vision. I quickly realized that I can just re-emphasize various aspects of my research statement as my answer. This also helps create a consistent image of who you are as a researcher.

(3) Don't Regurgitate Your CV. Your letter writers will do a far better job of describing your previous accomplishments than you will in your research statement. Trust in them to do that. Only describe your previous work to support the story you're trying to tell. For me, I used my previous work to demonstrate that machine learning with humans in the loop is both a broadly practical and an intellectually deep research area. But I kept it to a bare minimum -- previous work took up just under 1 page in my research statement. Your research statement is your one chance in your application package to describe your vision to the hiring committee. Don't waste it all on dwelling in the past.

(4) It's OK to Stretch the Truth a Little Bit. Because you're trying to keep the research statement concise, you can't accurately describe all the details of your previous work. For instance, when I described my prior work, I did not include all the caveats that necessarily come with any such research result. That is OK; everyone understands that your research results have caveats. People not in your area don't want to read a laundry list of assumptions and conditions that your result must be couched in. And people who are interested will read your actual research papers. You can explicitly highlight the more interesting limitations of your previous work when you talk about future research directions.

(5) Don't Bullshit Too Much. Of course, you must be somewhat speculative when you're laying out your research vision and describing future research directions. But make sure that your speculations are grounded in some kind of sound reasoning. The easiest way to do this is to demonstrate that you've already done some preliminary work in the future directions you want to pursue. For my research statement, I listed one piece of preliminary work that I've done for each future direction. This is also a nice way to incorporate the more interesting peripheral parts of your CV into your research vision.

(6) Get Lots of Feedback and Iterate. I had many great mentors and colleagues who contributed significantly in helping to sharpen the wording and focus of my research statement.

Again, not all of these points may work for everyone, and I'm sure there are plenty of other good tips that I didn't mention (examples here and here). But hopefully this was useful to some. Best of luck everyone!

(**) This is not to say that your actual accomplishments are not important. If there is no substance to your work, then your letter writers won't write you strong letters, and your research won't have garnered you much recognition and reputation. Having done substantial work is assumed by default in this post.

Friday, October 03, 2014

Caltech CMS Faculty Opening 2014

The CMS department is growing! We have an tenure-track faculty opening -- see the official ad here. We are interested in outstanding candidates from all areas of applied math and computer science. We value high-impact and cross-cutting fundamental research more than the specific area or discipline. However, I personally would be delighted if we developed a stronger presence in computational linguistics and/or network science (with a healthy dose of machine learning sprinkled in, of course). Please apply!