## Want to keep learning?

This content is taken from the The University of Waikato's online course, Advanced Data Mining with Weka. Join the course to learn more.
2.11

## The University of Waikato

Skip to 0 minutes and 11 seconds Twitter is a very nice example of a data stream, because it is data that is produced in real time. Twitter is a micro-blogging service that was built to discover what is happening at any moment in time. There are more than 300 million users, more that 2100 million search queries every day, and a very nice thing for us is that the data is public and it can be accessed through a streaming API. In this lesson, we’re going to look at an application of sentiment analysis. Sentiment analysis is the task of classifying messages or tweets into two categories, positive or negative, depending on the feelings that we can see inside the messages. Many times it’s very difficult to get the label data.

Skip to 0 minutes and 52 seconds In sentiment analysis with Twitter, there is very basic approach, but it works very well. We can get label data using the tweets that have emoticons inside. Many tweets have positive or negative emoticons, and then we can use this information to classify them as positive or negative. We can use all of these tweets to train our

Skip to 1 minute and 19 seconds model, and then we can predict using the tweets that don’t have emoticons: we can predict what is the current polarity, what is the current sentiment around any specific product or company or topic. An important thing that we need to look at when we are classifying tweets is that if data is balanced or not. Let’s look at an example. In this simple confusion matrix, what we see is that we are predicting 82% as positive and 18% of the instances as negative. What we see is that we are classifying correctly the positive class 75% of the instances and we are correct on the negative class for 10% of the instances. Our accuracy in this case is 85%.Is this good performance?

Skip to 2 minutes and 10 seconds To answer this, one way is that we can look at a random classifier. Imagine a random classifier that is predicting randomly but is following the same distribution between the positive class and negative class. This is the confusion matrix in the bottom. There we can see that this classifier is getting also 82% of the instances positives and is predicting as negative 18%. The interesting thing is that it is predicting the positive class correctly 68% of the time and the negative is predicted correctly in 3% of the instances.

Skip to 2 minutes and 52 seconds That means that the accuracy here is 71%.That means that, if our classifier is predicting with an accuracy higher than this, then we can say that is a good classifier, but if it’s predicting less than this 71%, then our classifier is not doing quite well. To see this, this is, as you may know, there is this kappa statistic measure that is measuring this difference, the difference between the accuracy of our classifier with the accuracy of a random classifier that is predicting using the same distribution of classes. Basically, the kappa statistic computes this difference, then it adds a normalizing factor so we get a value of kappa between 0 and 1.Now let’s look at an application.

Skip to 3 minutes and 39 seconds There is this Twitter sentiment corpus that was made by students at Stanford that contains tweets that were collected between April 2009 and June 2009. There are 800,000 tweets with positive emoticons and 800,000 tweets with negative emoticons. If we do a prequential evaluation using these tweets and we use a Naive Bayes multinomial classifier, Stochastic gradient descent classifier and a Hoeffding Tree, what we see is that at the end of the stream, the Stochastic gradient descent classifier gets an accuracy of 100%. This is something that is not normal, and then it’s nice to see why it’s happening.

Skip to 4 minutes and 31 seconds If you look at the kappa statistic, what we see is that at the moment that the accuracy goes up to 100%, the kappa statistic goes down. That means that, in that case, the data at that point starts to be completely unbalanced and only belonging to one class. In this data stream, if we compare accuracy and kappa of the multinomial Naive Bayes, Stochastic gradient descent, and Hoeffding Tree classifier, what we can see is that Stochastic gradient descent is better, but this is something that may not apply to other data streams. What is very interesting is that in data stream mining, we should also not only look at the accuracy, but also look at the resources, at time and memory.

Skip to 5 minutes and 20 seconds In this lesson, we have seen an application of Twitter classification. Twitter is a micro-blogging streaming service that is built to discover what is happening at any moment in time and, more specifically, what is happening now. Data may be unbalanced in many data streams, so it’s always important to not only look at the accuracy, but also look at other measures such as kappa statistics.

# Classifying tweets

Twitter is a vast, continuous, prolific, real time data stream. Sentiment analysis is the task of classifying tweets as positive or negative according to the feelings they express. Emoticons constitute “ground truth” that can serve as training data. Data sets are unbalanced, with far more positives than negatives (which, when you think about it, is a nice comment about the world in general). This presents an evaluation problem that can be addressed using the “Kappa” statistic, which measures the difference between a particular classifier and a random one that uses only the class distribution statistic.