How to Use Metropolis within Gibbs MCMC for Latent Dirichlet Analysis
Topic Modeling
Topic modeling is a form of unsupervised learning that seeks to categorize documents by topic. Typically the topics are considered to be latent variables and as such are to be discovered by the algorithm. We will look at one of the most common topic modeling algorithms: Latent Dirichlet Analysis (LDA).LDA (Latent Dirichlet Analysis)
LDA categorizes documents by topic via a generative probabilistic model. The idea is that words are generated from topics, and each document has a particular probability of using particular topics to generate a given word. We seek to find which topics given documents are likely to use to generate words. To explain this model, we introduce the following definitions:\(d\) – The number of documents\(t\) – The number of topics\(w\) – The number of unique Words\(n_i\) – The number of words in the \(i^{th}\) document\(D_i\) – The \(i^{th}\) document, \(1 \leq i \leq d\)\(T_i\) – The \(i^{th}\) topic, \(1 \leq i \leq t\)\(L_{ij}\) – The topic label of the \(j^{th}\) word in the \(i^{th}\) document, \(1 \leq j \leq n_i\), \(1 \leq i \leq d\)\(W_{ij}\) – The \(j^{th}\) word in the \(i^{th}\) document, \(1 \leq j \leq n_i\), \(1 \leq i \leq d\)The LDA model has six components:- Document Prior: A Dirichlet distribution consisting of \(t\) parameters, all equal to a single real hyper-parameter, \(\alpha\).
- Document Distributions: A categorical distribution for each document, whose \(t\) parameters give the probabilities that a particular topic will be used to generate words in that document.
- Topic Prior: A Dirichlet distribution consisting of \(w\) parameters, all equal to a single real hyper-parameter, \(\beta\).
- Topic Distributions: A categorical distribution for each topic, whose \(w\) parameters give the probabilities that a particular word will be generated from the given topic.
- Word-Topic Label: A nominal value for each word in each document specifying the topic ‘used’ by a given document to generate a particular word.
- Word: A nominal value representing the unique words of each document.
Metropolis within Gibbs MCMC
Once again, the are a number of ways to attempt to estimate the most probable values of the parameters of interest. Particularly popular are Markov Chain Monte Carlo (MCMC) methods. MCMC methods are sampling techniques that generate new samples from a Markov Chain whose stationary distribution is the distribution of interest. Given a number of conditions, as the number of samples approaches infinity, the empirical distribution is guaranteed to approach the distribution of interest with probability of 1.Concretely, we will examine the conceptually simplest of these: the Metropolis within Gibbs algorithm. This algorithm makes use of candidate functions that generate a new value for an unknown variable in the model, given the currently assigned value. These must be symmetric (though note that the Metropolis-Hastings variant of the algorithm works with non-symmetric candidate functions), such that the probability of such a function generating value Y given the current value is X must be the same as the probability of the function generating X given the current value is Y. The candidate functions must also be such that a sequence of applications of them must permit movement from any given value to any other value of the variable within a finite number of steps.Given such candidate functions, the basic Metropolis in Gibbs algorithm is:- Assign all unknown variables in the model a random initial value
- Repeat until the desired number of samples are generated:
- For each unknown variable in the model:
- A. Calculate the probability of the model given currently assigned values. Let this value be \(P_{old}\).
- B. Use the candidate function associated with the variable to generate a new candidate value for that variable.
- C. Calculate the probability of the model given this variable is assigned the candidate value and all other variables remain unchanged. Let this value be \(P_{new}\).
- D. Assign the variable the candidate value with probability \(max ( 1 , \frac{P_{new}}{P_{old}} )\)
- Document 1: “hello world”
- Document 2: “farewell cruel world”
- Document Prior: \(Dir \langle .1 , .1 \rangle\)
- Topic Prior: \(Dir\langle .2 , .2 \rangle\)
- Unique Words: $$ { hello , world , farewell, cruel }
- Document 1 Distribution: \(cat \langle p_{11} \rangle\)
- Document 2 Distribution: \(cat \langle p_{21} \rangle\)
- Topic 1 Distribution: \(cat \langle q_{11}, q_{12} ,q_{13} \rangle\)
- Topic 2 Distribution: \(cat \langle q_{21}, q_{22} ,q_{23} \rangle\)
- Word Topic Labels: \(\tau_{11} , \tau_{12} , \tau_{21} , \tau_{22} , \tau_{23}\)
- Document 1 Distribution: \(cat \langle p_{11}=.9 \rangle\)
- Document 2 Distribution: \(cat \langle p_{21}=.5 \rangle\)
- Topic 1 Distribution: \(cat \langle q_{11}=.1, q_{12}=.2 ,q_{13}=.3 \rangle\)
- Topic 2 Distribution: \(cat \langle q_{21}=.3, q_{22}=.3 ,q_{23}=.3 \rangle\)
- Word Topic Labels: \(\tau_{11}=T_1 , \tau_{12}=T_1 , \tau_{21} =T_2, \tau_{22} =T_1, \tau_{23} =T_2\)
Our purpose is to transform access to education.
We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.
We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.
Learn more about how FutureLearn is transforming access to education