Skip main navigation

How to Use Metropolis within Gibbs MCMC for Latent Dirichlet Analysis

We look at the latent Dirichlet analysis topic modeling algorithm, and see how to train it using the Metropolis in Gibbs MCMC sampling algorithm.
© Dr Michael Ashcroft

Topic Modeling

Topic modeling is a form of unsupervised learning that seeks to categorize documents by topic. Typically the topics are considered to be latent variables and as such are to be discovered by the algorithm. We will look at one of the most common topic modeling algorithms: Latent Dirichlet Analysis (LDA).

LDA (Latent Dirichlet Analysis)

LDA categorizes documents by topic via a generative probabilistic model. The idea is that words are generated from topics, and each document has a particular probability of using particular topics to generate a given word. We seek to find which topics given documents are likely to use to generate words. To explain this model, we introduce the following definitions:
\(d\) – The number of documents
\(t\) – The number of topics
\(w\) – The number of unique Words
\(n_i\) – The number of words in the \(i^{th}\) document
\(D_i\) – The \(i^{th}\) document, \(1 \leq i \leq d\)
\(T_i\) – The \(i^{th}\) topic, \(1 \leq i \leq t\)
\(L_{ij}\) – The topic label of the \(j^{th}\) word in the \(i^{th}\) document, \(1 \leq j \leq n_i\), \(1 \leq i \leq d\)
\(W_{ij}\) – The \(j^{th}\) word in the \(i^{th}\) document, \(1 \leq j \leq n_i\), \(1 \leq i \leq d\)
The LDA model has six components:
  1. Document Prior: A Dirichlet distribution consisting of \(t\) parameters, all equal to a single real hyper-parameter, \(\alpha\).
  2. Document Distributions: A categorical distribution for each document, whose \(t\) parameters give the probabilities that a particular topic will be used to generate words in that document.
  3. Topic Prior: A Dirichlet distribution consisting of \(w\) parameters, all equal to a single real hyper-parameter, \(\beta\).
  4. Topic Distributions: A categorical distribution for each topic, whose \(w\) parameters give the probabilities that a particular word will be generated from the given topic.
  5. Word-Topic Label: A nominal value for each word in each document specifying the topic ‘used’ by a given document to generate a particular word.
  6. Word: A nominal value representing the unique words of each document.
Of these, the words are known and the \(\alpha\) and \(\beta\) hyper-parameters are specified, as is the number of topics. What is desired is to estimate the values of the parameters in the topic and document distributions. The first tells us what each topic is about, and the second tells us what topics are used in each document.
Typically, the \(\alpha\) and \(\beta\) hyper-parameters are small (less than 1, but greater than 0). The effect of this is to make each topic give high probability values to a small subset of words, and each document give high probability values to a small number of topics. See the section on Dirichlet distributions for a deeper discussion of how this works.

Metropolis within Gibbs MCMC

Once again, the are a number of ways to attempt to estimate the most probable values of the parameters of interest. Particularly popular are Markov Chain Monte Carlo (MCMC) methods. MCMC methods are sampling techniques that generate new samples from a Markov Chain whose stationary distribution is the distribution of interest. Given a number of conditions, as the number of samples approaches infinity, the empirical distribution is guaranteed to approach the distribution of interest with probability of 1.
Concretely, we will examine the conceptually simplest of these: the Metropolis within Gibbs algorithm. This algorithm makes use of candidate functions that generate a new value for an unknown variable in the model, given the currently assigned value. These must be symmetric (though note that the Metropolis-Hastings variant of the algorithm works with non-symmetric candidate functions), such that the probability of such a function generating value Y given the current value is X must be the same as the probability of the function generating X given the current value is Y. The candidate functions must also be such that a sequence of applications of them must permit movement from any given value to any other value of the variable within a finite number of steps.
Given such candidate functions, the basic Metropolis in Gibbs algorithm is:
  1. Assign all unknown variables in the model a random initial value
  2. Repeat until the desired number of samples are generated:
  • For each unknown variable in the model:
  • A. Calculate the probability of the model given currently assigned values. Let this value be \(P_{old}\).
  • B. Use the candidate function associated with the variable to generate a new candidate value for that variable.
  • C. Calculate the probability of the model given this variable is assigned the candidate value and all other variables remain unchanged. Let this value be \(P_{new}\).
  • D. Assign the variable the candidate value with probability \(max ( 1 , \frac{P_{new}}{P_{old}} )\)
Since the initial state is assigned randomly, the first \(N\) samples are discarded. This is known as the \(burn\) and is a hyper-parameter chosen by the data-scientist. The remaining samples are then used to estimate the conditional probability distribution of the unknown variables in the model given the known variables. There are methods for checking that sufficient samples have probably been collected, the most common being to run multiple instances of the algorithm and continue sampling until all converge. There are also various ways to improve the algorithm beyond this basic form. Interested students are invited to read the literature.
We will look at what this means in practice in the LDA case with a simple example. Let us assume we have two documents:
  • Document 1: “hello world”
  • Document 2: “farewell cruel world”
Let us stipulate that we want two topics, \(T_1\) and \(T_2\), and that our hyper-parameters are \(\alpha=.1\) and \(\beta=.2\).
Noting that we will be treating parameters of the probability distributions as variables, in addition to the document words, the known variables of the model are:
  • Document Prior: \(Dir \langle .1 , .1 \rangle\)
  • Topic Prior: \(Dir\langle .2 , .2 \rangle\)
  • Unique Words: $$ { hello , world , farewell, cruel }
The unknown variables are:
  • Document 1 Distribution: \(cat \langle p_{11} \rangle\)
  • Document 2 Distribution: \(cat \langle p_{21} \rangle\)
  • Topic 1 Distribution: \(cat \langle q_{11}, q_{12} ,q_{13} \rangle\)
  • Topic 2 Distribution: \(cat \langle q_{21}, q_{22} ,q_{23} \rangle\)
  • Word Topic Labels: \(\tau_{11} , \tau_{12} , \tau_{21} , \tau_{22} , \tau_{23}\)
Note that the distributions have one fewer variables than values, since they must sum to one.
To begin we assign all unknown variables random values:
  • Document 1 Distribution: \(cat \langle p_{11}=.9 \rangle\)
  • Document 2 Distribution: \(cat \langle p_{21}=.5 \rangle\)
  • Topic 1 Distribution: \(cat \langle q_{11}=.1, q_{12}=.2 ,q_{13}=.3 \rangle\)
  • Topic 2 Distribution: \(cat \langle q_{21}=.3, q_{22}=.3 ,q_{23}=.3 \rangle\)
  • Word Topic Labels: \(\tau_{11}=T_1 , \tau_{12}=T_1 , \tau_{21} =T_2, \tau_{22} =T_1, \tau_{23} =T_2\)
To generate the first sample, we must iterate through all unknown variables, generate a candidate value for them, compare probabilities of the model with the currently assigned value versus the candidate value, and possible assign the candidate value according to the acceptance equation in the algorithm. We look at only the first variable: \(p_{11}\). Since all the probability for all inknown variables except for \(\tau_{11}\) and \(\tau_{12}\) is conditionally independent of \(p_{11}\) given \(\tau_{11}\) and \(\tau_{12}\), we only need to examine the effects of the value of \(p_{11}\) on the probability of these two variables taking their assigned values.
First we calculate \(P_{old}\). The value currently assigned to \(p_{11}\) is \(.9\). Since it is drawn from the distribution \(Dir \langle .1 , .1 \rangle\) we have (rounding numbers to two decimal places):
\(p(p_{11}=.9)=0.44\) (Or, if you like, \(p(p_{11}=.9; \alpha=.1)=0.44\))
We also have:
Multiplying them we get: \(P_{old}=(.44)(.9)(.9)=.36\).
Let us assume our candidate function for \(p_{11}\) is \(p_{11}^{new}\)|\(p_{11}^{old}=\theta)=unif(0,1)\). This means that the candidate function just samples a value from a uniform distribution between 0 and 1, ignoring the current value. We imagine that the candidate function offers the value \(.52\) as the candidate. We now use this to calculate \(P_{new}\) using the same process with the new value:
Multiplying them we get: \(P_{old}=(.18)(.52)(.52)=.05\).
Accordingly, the probability of replacing the current assignment of \(.9\) to \(p_{11}\) with the candidate value \(.52\) is \(\frac{.05}{.36}=.14\).
We will stop there, but we would continue doing the same thing for all other unknown variables to get our first sample. Then we would repeat the whole thing many times to get all our samples. Eventually we would use these samples (discarding the burn) to estimate the parameters of the document and topic distributions and we would be done!
© Dr Michael Ashcroft
This article is from the free online

Advanced Machine Learning

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education