Topic Modeling: Latent Dirichlet Allocation
Topic Modeling
Topic modeling is a form of unsupervised learning that seeks to categorize documents by topic. Typically the topics are considered to be latent variables and as such are to be discovered by the algorithm. We will look at one of the most common topic modeling algorithms: Latent Dirichlet Analysis (LDA).
LDA (Latent Dirichlet Analysis)
LDA categorizes documents by topic via a generative probabilistic model. The idea is that words are generated from topics, and each document has a particular probability of using particular topics to generate a given word. We seek to find which topics given documents are likely to use to generate words. To explain this model, we introduce the following definitions:
\(d\)  The number of documents
\(t\)  The number of topics
\(w\)  The number of unique Words
\(n_i\)  The number of words in the \(i^{th}\) document
\(D_i\)  The \(i^{th}\) document, \(1 \leq i \leq d\)
\(T_i\)  The \(i^{th}\) topic, \(1 \leq i \leq t\)
\(L_{ij}\)  The topic label of the \(j^{th}\) word in the \(i^{th}\) document, \(1 \leq j \leq n_i\), \(1 \leq i \leq d\)
\(W_{ij}\)  The \(j^{th}\) word in the \(i^{th}\) document, \(1 \leq j \leq n_i\), \(1 \leq i \leq d\)
The LDA model has six components:
 Document Prior: A Dirichlet distribution consisting of \(t\) parameters, all equal to a single real hyperparameter, \(\alpha\).
 Document Distributions: A categorical distribution for each document, whose \(t\) parameters give the probabilities that a particular topic will be used to generate words in that document.
 Topic Prior: A Dirichlet distribution consisting of \(w\) parameters, all equal to a single real hyperparameter, \(\beta\).
 Topic Distributions: A categorical distribution for each topic, whose \(w\) parameters give the probabilities that a particular word will be generated from the given topic.
 WordTopic Label: A nominal value for each word in each document specifying the topic ‘used’ by a given document to generate a particular word.
 Word: A nominal value representing the unique words of each document.
Of these, the words are known and the \(\alpha\) and \(\beta\) hyperparameters are specified, as is the number of topics. What is desired is to estimate the values of the parameters in the topic and document distributions. The first tells us what each topic is about, and the second tells us what topics are used in each document.
Typically, the \(\alpha\) and \(\beta\) hyperparameters are small (less than 1, but greater than 0). The effect of this is to make each topic give high probability values to a small subset of words, and each document give high probability values to a small number of topics. See the section on Dirichlet distributions for a deeper discussion of how this works.
Metropolis within Gibbs MCMC
Once again, the are a number of ways to attempt to estimate the most probable values of the parameters of interest. Particularly popular are Markov Chain Monte Carlo (MCMC) methods. MCMC methods are sampling techniques that generate new samples from a Markov Chain whose stationary distribution is the distribution of interest. Given a number of conditions, as the number of samples approaches infinity, the empirical distribution is guaranteed to approach the distribution of interest with probability of 1.
Concretely, we will examine the conceptually simplest of these: the Metropolis within Gibbs algorithm. This algorithm makes use of candidate functions that generate a new value for an unknown variable in the model, given the currently assigned value. These must be symmetric (though note that the MetropolisHastings variant of the algorithm works with nonsymmetric candidate functions), such that the probability of such a function generating value Y given the current value is X must be the same as the probability of the function generating X given the current value is Y. The candidate functions must also be such that a sequence of applications of them must permit movement from any given value to any other value of the variable within a finite number of steps.
Given such candidate functions, the basic Metropolis in Gibbs algorithm is:
 Assign all unknown variables in the model a random initial value
 Repeat until the desired number of samples are generated:
 For each unknown variable in the model:
 A. Calculate the probability of the model given currently assigned values. Let this value be \(P_{old}\).
 B. Use the candidate function associated with the variable to generate a new candidate value for that variable.
 C. Calculate the probability of the model given this variable is assigned the candidate value and all other variables remain unchanged. Let this value be \(P_{new}\).
 D. Assign the variable the candidate value with probability \(max ( 1 , \frac{P_{new}}{P_{old}} )\)
Since the initial state is assigned randomly, the first \(N\) samples are discarded. This is known as the \(burn\) and is a hyperparameter chosen by the datascientist. The remaining samples are then used to estimate the conditional probability distribution of the unknown variables in the model given the known variables. There are methods for checking that sufficient samples have probably been collected, the most common being to run multiple instances of the algorithm and continue sampling until all converge. There are also various ways to improve the algorithm beyond this basic form. Interested students are invited to read the literature.
We will look at what this means in practice in the LDA case with a simple example. Let us assume we have two documents:

Document 1: “hello world”

Document 2: “farewell cruel world”
Let us stipulate that we want two topics, \(T_1\) and \(T_2\), and that our hyperparameters are \(\alpha=.1\) and \(\beta=.2\).
Noting that we will be treating parameters of the probability distributions as variables, in addition to the document words, the known variables of the model are:

Document Prior: \(Dir \langle .1 , .1 \rangle\)

Topic Prior: \(Dir\langle .2 , .2 \rangle\)

Unique Words: $$ { hello , world , farewell, cruel }
The unknown variables are:

Document 1 Distribution: \(cat \langle p_{11} \rangle\)

Document 2 Distribution: \(cat \langle p_{21} \rangle\)

Topic 1 Distribution: \(cat \langle q_{11}, q_{12} ,q_{13} \rangle\)

Topic 2 Distribution: \(cat \langle q_{21}, q_{22} ,q_{23} \rangle\)

Word Topic Labels: \(\tau_{11} , \tau_{12} , \tau_{21} , \tau_{22} , \tau_{23}\)
Note that the distributions have one fewer variables than values, since they must sum to one.
To begin we assign all unknown variables random values:

Document 1 Distribution: \(cat \langle p_{11}=.9 \rangle\)

Document 2 Distribution: \(cat \langle p_{21}=.5 \rangle\)

Topic 1 Distribution: \(cat \langle q_{11}=.1, q_{12}=.2 ,q_{13}=.3 \rangle\)

Topic 2 Distribution: \(cat \langle q_{21}=.3, q_{22}=.3 ,q_{23}=.3 \rangle\)

Word Topic Labels: \(\tau_{11}=T_1 , \tau_{12}=T_1 , \tau_{21} =T_2, \tau_{22} =T_1, \tau_{23} =T_2\)
To generate the first sample, we must iterate through all unknown variables, generate a candidate value for them, compare probabilities of the model with the currently assigned value versus the candidate value, and possible assign the candidate value according to the acceptance equation in the algorithm. We look at only the first variable: \(p_{11}\). Since all the probability for all inknown variables except for \(\tau_{11}\) and \(\tau_{12}\) is conditionally independent of \(p_{11}\) given \(\tau_{11}\) and \(\tau_{12}\), we only need to examine the effects of the value of \(p_{11}\) on the probability of these two variables taking their assigned values.
First we calculate \(P_{old}\). The value currently assigned to \(p_{11}\) is \(.9\). Since it is drawn from the distribution \(Dir \langle .1 , .1 \rangle\) we have (rounding numbers to two decimal places):
\(p(p_{11}=.9)=0.44\) (Or, if you like, \(p(p_{11}=.9; \alpha=.1)=0.44\))
We also have:
\(p(\tau_{11}=T_1\)\(p_{11}=.9)=0.9\)
\(p(\tau_{12}=T_1\)\(p_{11}=.9)=0.9\)
Multiplying them we get: \(P_{old}=(.44)(.9)(.9)=.36\).
Let us assume our candidate function for \(p_{11}\) is \(p_{11}^{new}\)\(p_{11}^{old}=\theta)=unif(0,1)\). This means that the candidate function just samples a value from a uniform distribution between 0 and 1, ignoring the current value. We imagine that the candidate function offers the value \(.52\) as the candidate. We now use this to calculate \(P_{new}\) using the same process with the new value:
\[p(p_{11}=.52)=0.18\]\(p(\tau_{11}=T_1\)\(p_{11}=.52)=0.52\)
\(p(\tau_{12}=T_1\)\(p_{11}=.52)=0.52\)
Multiplying them we get: \(P_{old}=(.18)(.52)(.52)=.05\).
Accordingly, the probability of replacing the current assignment of \(.9\) to \(p_{11}\) with the candidate value \(.52\) is \(\frac{.05}{.36}=.14\).
We will stop there, but we would continue doing the same thing for all other unknown variables to get our first sample. Then we would repeat the whole thing many times to get all our samples. Eventually we would use these samples (discarding the burn) to estimate the parameters of the document and topic distributions and we would be done!
© Dr Michael Ashcroft