Saturday, September 11, 2010

Latent Variable Models for Sentiment Classification

One of the major limitations to applying machine learning to train complex prediction models is the need for "training labels". For example, when classifying the sentiment of a document (e.g., a movie review, or a speech), we often base our classification on only a portion of the document. In particular, a movie review typically contains so-called "objective" components such as a summary of the plot of the movie, but such components do not provide information regarding the sentiment (e.g., positive or negative review).

It would be nice to be able to encode such structure into the prediction models we want to train using machine learning. In fact, it is known that properly incorporating such structure can improve the prediction accuracy of machine learned models. However, this is where the limitation kicks in. To train such a model using conventional machine learning techniques, you'd need to acquire manually labeled documents with both the overall sentiment, as well as the sentences of the document that explains the sentiment (e.g., the sentence, "I love this movie!"). And you'd likely have to do this separately for each different domain (e.g., the distribution of words typically used to express sentiment in movie reviews is probably different from the words congressmen use to support or oppose a bill). This type of data is rather costly to obtain, and is often inaccurately labeled by the human labelers.

This is where latent variable models come to the rescue. Latent variable models essentially assume that some of the variables in your model are hidden and cannot be observed. For example, suppose I only have training data with sentiment labels, then the hidden variables correspond to the sentences which best explain the sentiment of each document. As you might imagine, this is a much harder learning problem than knowing all the training labels to start with.

Fortunately, there's been a growing body of work showing just how to train such latent variable models in a way so as to maximize final prediction performance. In a newly accepted paper with Ainur Yessenalina and Claire Cardie (to appear at EMNLP 2010), we've shown how to apply these methods to sentiment classification with really nice results.

Without sacrificing sentiment classification performance, our approach can actually learn to extract the best supporting sentences despite not knowing that information a priori. In fact, since we make good use of this structural assumption (that there is a subset of sentences that best explains the document-level sentiment), our model actually achieves better predictive performance than previous approaches, sometimes substantially.

We've also released the source code and data, so please check them out if you're interested in playing around with our approach.

I particularly like the example given in Table 4 in the paper, which is also shown below. This speech is from the US Congressional floor debates transcripts, and was made in support of the Stem Cell Research Enhancement Act (so the sentiment classification is positive, or "yea"). The best supporting sentences identified by our model is shown in bold face, and the least "subjective" sentences are shown in underline.
Mr. Speaker, I am proud to stand on the house floor today to speak in favor of the Stem Cell Research Enhancement Act, legislation which will bring hope to millions of people suffering from disease in this nation. I want to thank Congresswoman Degette and Congressman Castle for their tireless work in bringing this bill to the house floor for a vote.

The discovery of embryonic stem cells is a major scientific breakthrough. Embryonic stem cells have the potential to form any cell type in the human body. This could have profound implications for diseases such as Alzheimer’s, Parkinson’s, various forms of brain and spinal cord disorders, diabetes, and many types of cancer. According to the Coalition for the Advancement of Medical Research, there are at least 58 diseases which could potentially be cured through stem cell research.

That is why more than 200 major patient groups, scientists, and medical research groups and 80 Nobel Laureates support the Stem Cell Research Enhancement Act. They know that this legislation will give us a chance to find cures to diseases affecting 100 million Americans.

I want to make clear that I oppose reproductive cloning, as we all do. I have voted against it in the past. However, that is vastly different from stem cell research and as an ovarian cancer survivor, I am not going to stand in the way of science.

Permitting peer-reviewed Federal funds to be used for this research, combined with public oversight of these activities, is our best assurance that research will be of the highest quality and performed with the greatest dignity and moral responsibility. The policy President Bush announced in August 2001 has limited access to stem cell lines and has stalled scientific progress.

As a cancer survivor, I know the desperation these families feel as they wait for a cure. This congress must not stand in the way of that progress. We have an opportunity to change the lives of millions, and I hope we take it. I urge my colleagues to support this legislation.
The thing I find cool is that the least "subjective" (or equivalently, most "objective") sentences could plausibly have come from a speech made in opposition to a bill limiting stem cell research. That is to say, these sentences don't actually reveal much about the speaker's stance towards the specific bill in question.

No comments: