Skip to 0 minutes and 2 secondsOK, so in this example exercise, we're going to be looking at LDA, latent dirichlet allocation. Now, because we're working with LDA, we're going to be doing text analytics here. We're going to be trying to discover topics in documents. There's an enormous amount that you need to know to be able to do text analytics well, and most of that is beyond this course. So rather than trying to cram it in, we'll just explain what's necessary to understand and follow the code, and this will be gone into a little bit more detail in the comments in the actual code script. But I will try to explain that as well as I go.
Skip to 0 minutes and 46 secondsWe start like usual by comparing the libraries, the packages, and the data [? sync. ?] This time, we'll be working with the New York Times data. This data set contains two columns.
Skip to 1 minute and 8 secondsLittle bit difficult to see clearly. This text set contains two columns. Each row is related to an article in the New York Times. The first column is the title column, which gives the title of the article. Second column is a subject column, which just gives a short subject description of the article. Now, we've sampled 500 articles from the full data set in the text tools package. So we've only got 500 rows, and these are just the article number.
Skip to 1 minute and 48 secondsNow we're going to perform LDA on this data. We're going to try to find topic models where the text associated with each article is a combination of the title and the subject. First thing we're going to want to do is to create a document matrix. We are going to be using the create matrix function from the [? R Text ?] [? Tools ?] package.
Skip to 2 minutes and 22 secondsThere we've got it. Now we have our document-term matrix created. We're also going to specify how many topics we want to look for. We'll just look for 20 so that it runs quickly. And then, we can perform LDA.
Skip to 2 minutes and 38 secondsThis will take a little bit of time, but not too long.
Skip to 2 minutes and 45 secondsNow, once this finishes, once the LDA function finishes running and we've created our model, we're going to look at the information contained within it. We're going to create a few graphs, so we will set them up here. The first thing we can do is we can look at the most likely topics per document.
Skip to 3 minutes and 10 secondsHere, for example, we query the LDA model object to find the most likely topics per document, the three most likely topics. And we're looking at the first 10 documents. So we see that document one, the three most likely topics are topic 11, topic one, and topic 19. Now, this just gives us the index of the topic. It doesn't tell us what that topic is about. We have to examine the topic distributions of the words to get some idea of how we should interpret the meaning or the content associated with each topic. But we know that, for example, document one, the three most associated topics are topic 11, topic one, and topic 19.
Skip to 3 minutes and 59 secondsBut actually, that's not particularly useful, because what we really want to know is how probable it is that each topic will be chosen when-- by a particular document. So instead of looking at the said three most associated topics with each document, why don't we actually look at the probabilities?
Skip to 4 minutes and 29 secondsAnd so for example, we see the topic distribution of document one looks like this. So the three most associated topics for document one were topic 11, topic one, and topic 19. But actually, it's overwhelmingly simply topic 11 that is associated with document one. That's to say that in the topic distribution of document one, the probability is almost entirely at topic 11. Let's see if we can find a document with the distribution not entirely at a single topic-- here we go. So we see from document four, topic 13 has a high probability, but topic 11 also has a reasonable probability.
Skip to 5 minutes and 16 secondsAnd so eventually, looking at the probability distribution, rather than just the three most probable topics, or [INAUDIBLE] probable topics is much more informative. We can also view the most likely terms per topic.
Skip to 5 minutes and 35 secondsHere, we can just do the n most likely terms per topic.
Skip to 5 minutes and 42 secondsSo topic one, the n most likely terms, Israel, Mideast, talks, peace, Clinton, already giving a fair indication of what this topic is about. Once again, just like it's less informative to look at the n most probable topics associated with the document, it's slightly less informative when we can obtain more information by looking at the actual distribution of the terms for particular topics. And here, we have the term distribution for topic one.
Skip to 6 minutes and 29 secondsSo we're able to query the LDA model in order to obtain the topic distribution per documents. And in order to obtain the term distribution for topics, if we want less information, we can just look at the n most probable topics per document, or n most probable terms per topic. These topics are not immediately interpretable. We actually have to go in and look at the term distribution for each topic in order to get some idea about what these topics are actually about. But we do succeed in performing reasonably interesting topic modelling with LDA.
Skip to 7 minutes and 11 secondsAnd we see that these topics do appear to have reasonable interpretive content, such as topic one, Israel, Mideast, talks, peace, Clinton, topic 14, New York Mayor, NYC, school, et cetera.
LDA Exercise 1
A video exercise for latent Dirichlet analysis. The associated code is in the LDA Ex1.R file. Interested students are encouraged to replicate what we go through in the video themselves in R, but note that this is an optional activity intended for those who want practical experience in R and machine learning.
In this exercise we use LDA on a dataset containing the headlines and summaries of New York times articles, in an attempt to classify these articles by topic. We begin by creating a document-term matrix, using functionality from the RTextTools package. We then specify the number of topics we want to find and are able to perform LDA using functionality from the topicmodels package. We spend some time discussing the step taken and the results obtained, including looking at the most useful information that is available from the LDA model: (i) The probability distributions for topics given document; and (ii) The probability distributions for words given topics.
Note that the RTextTools and topicmodels R packages are used in this exercise. You will need to have them installed on your system. You can install packages using the install.packages function in R.
Please note that the audio quality on parts of this video are of lower quality that other videos in this course.
© Dr Michael Ashcroft