In the preceding chapters, we considered an approximation to the HMM parameter estimation problem using the expectation maximization algorithm. Now we turn to the Bayesian approach to parameter learning.

In Bayesian terms, our problem is to estimate the posterior distribution over parameters, given the data as well as the hyperparameters of the model.

\[\begin{align} \Pr(\boldsymbol{\theta}_T, \boldsymbol{\theta}_O \mid \mathbf{C}, \boldsymbol{\alpha}_T, \boldsymbol{\alpha}_O) &=& \frac{ \Pr(\boldsymbol{\theta}_O \mid \boldsymbol{\alpha}_O)\Pr(\boldsymbol{\theta}_T \mid \boldsymbol{\alpha}_T) \prod_{\vec{w} \in \mathbf{C} } \sum_{\vec{c}} \prod_{i=1}^{|\vec{w}|} \Pr(c^{(i)} | c^{(i-1)}, \vec{\theta}_{T,c^{(i-1)}}) \Pr(w^{(i)} | c^{(i)}, \vec{\theta}_{O,c^{(i)}}) } {\int_{\boldsymbol{\theta}_T} \int_{\boldsymbol{\theta}_O} \Pr(\boldsymbol{\theta}_O \mid \boldsymbol{\alpha}_O)\Pr(\boldsymbol{\theta}_T \mid \boldsymbol{\alpha}_T) \prod_{\vec{w} \in \mathbf{C} } \sum_{\vec{c}} \prod_{i=1}^{|\vec{w}|} \Pr(c^{(i)} | c^{(i-1)}, \vec{\theta}_{T,c^{(i-1)}}) \Pr(w^{(i)} | c^{(i)}, \vec{\theta}_{O,c^{(i)}}) \,d\vec{\theta}_O\,d\vec{\theta}_T } \\ \end{align}\]

As has become familiar at this stage, we have the problem that summing over all possible derivations for each observed sentence is intractable. We now also have the additional problem that we do not know how to solve the integrals over the parameters which appear outside of these sums in closed form. How can we approximate this complex posterior distribution?

One useful observation is to recall the two possible representations of probability distributions we have been using throughout the coruse: samplers and scorers. While implementing an exact scorer for this distribution is non-trivial, we can implement an approximate sampler.

What does the strange phrase approximate sampler mean? If we can sample directly from some distribution this is called exact sampling, however, often we can’t take exact i.i.d. samples from our distribution of interest. Instead, we take samples that are either from a different distribution or not i.i.d. and use these in some way to approximate the true distribution.

The first such algorithm we will examine is called Gibbs sampling the idea of Gibbs sampling is to sample a subset of the latent random variables in a model conditioned on fixed values for the others.

In the case of HMMs, this leads to an algorithm which is similar to a sampling version of the Baum-Welch algorithm. We alternate between sampling from the conditional posterior distribution of the parameters given some fixed set of derivations,

\[\begin{align} \Pr(\boldsymbol{\theta}_T, \boldsymbol{\theta}_O \mid \mathbf{C}, \mathbf{D}, \boldsymbol{\alpha}_T, \boldsymbol{\alpha}_O) &=& \frac{ \Pr(\boldsymbol{\theta}_O \mid \boldsymbol{\alpha}_O)\Pr(\boldsymbol{\theta}_T \mid \boldsymbol{\alpha}_T) \prod_{\vec{w} \in \mathbf{C} } \prod_{i=1}^{|\vec{w}|} \Pr(c^{(i)} | c^{(i-1)}, \vec{\theta}_{T,c^{(i-1)}}) \Pr(w^{(i)} | c^{(i)}, \vec{\theta}_{O,c^{(i)}}) } {\int_{\boldsymbol{\theta}_T} \int_{\boldsymbol{\theta}_O} \Pr(\boldsymbol{\theta}_O \mid \boldsymbol{\alpha}_O)\Pr(\boldsymbol{\theta}_T \mid \boldsymbol{\alpha}_T) \prod_{\vec{w} \in \mathbf{C} } \prod_{i=1}^{|\vec{w}|} \Pr(c^{(i)} | c^{(i-1)}, \vec{\theta}_{T,c^{(i-1)}}) \Pr(w^{(i)} | c^{(i)}, \vec{\theta}_{O,c^{(i)}}) \,d\vec{\theta}_O\,d\vec{\theta}_T } \\ \end{align}\]

and sampling from the conditional posterior over derivations given some fixed set of parameter values:

\[\begin{align} \Pr(\mathbf{D} \mid \boldsymbol{\theta}_T, \boldsymbol{\theta}_O, \mathbf{C}, \boldsymbol{\alpha}_T, \boldsymbol{\alpha}_O) &=& \frac{ \Pr(\boldsymbol{\theta}_O \mid \boldsymbol{\alpha}_O)\Pr(\boldsymbol{\theta}_T \mid \boldsymbol{\alpha}_T) \prod_{\vec{w} \in \mathbf{C} } \prod_{i=1}^{|\vec{w}|} \Pr(c^{(i)} | c^{(i-1)}, \vec{\theta}_{T,c^{(i-1)}}) \Pr(w^{(i)} | c^{(i)}, \vec{\theta}_{O,c^{(i)}}) } { \Pr(\boldsymbol{\theta}_O \mid \boldsymbol{\alpha}_O)\Pr(\boldsymbol{\theta}_T \mid \boldsymbol{\alpha}_T) \prod_{\vec{w} \in \mathbf{C} } \sum_{\vec{c}} \prod_{i=1}^{|\vec{w}|} \Pr(c^{(i)} | c^{(i-1)}, \vec{\theta}_{T,c^{(i-1)}}) \Pr(w^{(i)} | c^{(i)}, \vec{\theta}_{O,c^{(i)}}) } \\ \end{align}\]

The latter probability distribution we have just studied in depth and know how to work with. The former, of course, has a closed form solution as we saw in Inference using Conditionalization.


37 Expectation Maximization 39 Markov Chain Monte Carlo