Skip main navigation

Classifying tweets

Albert Bifet performs sentiment analysis – classifying tweets as +ve or –ve according to the feelings they express – on the Twitter data stream.
Twitter is a very nice example of a data stream, because it is data that is produced in real time. Twitter is a micro-blogging service that was built to discover what is happening at any moment in time. There are more than 300 million users, more that 2100 million search queries every day, and a very nice thing for us is that the data is public and it can be accessed through a streaming API. In this lesson, we’re going to look at an application of sentiment analysis. Sentiment analysis is the task of classifying messages or tweets into two categories, positive or negative, depending on the feelings that we can see inside the messages. Many times it’s very difficult to get the label data.
In sentiment analysis with Twitter, there is very basic approach, but it works very well. We can get label data using the tweets that have emoticons inside. Many tweets have positive or negative emoticons, and then we can use this information to classify them as positive or negative. We can use all of these tweets to train our
model, and then we can predict using the tweets that don’t have emoticons: we can predict what is the current polarity, what is the current sentiment around any specific product or company or topic. An important thing that we need to look at when we are classifying tweets is that if data is balanced or not. Let’s look at an example. In this simple confusion matrix, what we see is that we are predicting 82% as positive and 18% of the instances as negative. What we see is that we are classifying correctly the positive class 75% of the instances and we are correct on the negative class for 10% of the instances. Our accuracy in this case is 85%.Is this good performance?
To answer this, one way is that we can look at a random classifier. Imagine a random classifier that is predicting randomly but is following the same distribution between the positive class and negative class. This is the confusion matrix in the bottom. There we can see that this classifier is getting also 82% of the instances positives and is predicting as negative 18%. The interesting thing is that it is predicting the positive class correctly 68% of the time and the negative is predicted correctly in 3% of the instances.
That means that the accuracy here is 71%.That means that, if our classifier is predicting with an accuracy higher than this, then we can say that is a good classifier, but if it’s predicting less than this 71%, then our classifier is not doing quite well. To see this, this is, as you may know, there is this kappa statistic measure that is measuring this difference, the difference between the accuracy of our classifier with the accuracy of a random classifier that is predicting using the same distribution of classes. Basically, the kappa statistic computes this difference, then it adds a normalizing factor so we get a value of kappa between 0 and 1.Now let’s look at an application.
There is this Twitter sentiment corpus that was made by students at Stanford that contains tweets that were collected between April 2009 and June 2009. There are 800,000 tweets with positive emoticons and 800,000 tweets with negative emoticons. If we do a prequential evaluation using these tweets and we use a Naive Bayes multinomial classifier, Stochastic gradient descent classifier and a Hoeffding Tree, what we see is that at the end of the stream, the Stochastic gradient descent classifier gets an accuracy of 100%. This is something that is not normal, and then it’s nice to see why it’s happening.
If you look at the kappa statistic, what we see is that at the moment that the accuracy goes up to 100%, the kappa statistic goes down. That means that, in that case, the data at that point starts to be completely unbalanced and only belonging to one class. In this data stream, if we compare accuracy and kappa of the multinomial Naive Bayes, Stochastic gradient descent, and Hoeffding Tree classifier, what we can see is that Stochastic gradient descent is better, but this is something that may not apply to other data streams. What is very interesting is that in data stream mining, we should also not only look at the accuracy, but also look at the resources, at time and memory.
In this lesson, we have seen an application of Twitter classification. Twitter is a micro-blogging streaming service that is built to discover what is happening at any moment in time and, more specifically, what is happening now. Data may be unbalanced in many data streams, so it’s always important to not only look at the accuracy, but also look at other measures such as kappa statistics.

Twitter is a vast, continuous, prolific, real time data stream. Sentiment analysis is the task of classifying tweets as positive or negative according to the feelings they express. Emoticons constitute “ground truth” that can serve as training data. Data sets are unbalanced, with far more positives than negatives (which, when you think about it, is a nice comment about the world in general). This presents an evaluation problem that can be addressed using the “Kappa” statistic, which measures the difference between a particular classifier and a random one that uses only the class distribution statistic.

This article is from the free online

Advanced Data Mining with Weka

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education