The algorithm we examined for HMM parameter estimation, the Baum-Welch algorithm, is a special case of a very general and useful class of algorithms known as expectation maximization (EM) algorithms.

Expectation maximization algorithms are a general way to search for a maximum likelihood or maximum a posteriori point estimate of some parameters when there are hidden variables that must be marginalized away thus making an direct solution impossible. In other words, it is a method for approximately optimizing the objective function below, which contains a sum over some latent variables.

\[\DeclareMathOperator*{\argmax}{arg\,max} \underset {\theta }\argmax \log \Pr(D | \theta) = \underset {\theta }\argmax \log \sum_{H} \Pr(D, H \mid \theta)\]

It can also be used for searching for a point estimate of \(\theta\) that optimizes a Bayesian criterion \(\log \Pr(D | \theta)\Pr(\theta) = \log \sum_{H} \Pr(D, H \mid \theta) \Pr(\theta)\). Note that this is not full Bayesian inference, since it only searches for a single set of parameter values for the whole parameter space.

The basic idea behind EM is that instead of directly optimizing the intractable objective function above, we work with an alternative objective.

\[\DeclareMathOperator*{\argmax}{arg\,max} \underset {\theta }\argmax \sum_{H} P(H | D, \theta^{\mathrm{old}}) \log P(D,H \mid \theta)\]

This objective makes use of the posterior over our latent variables \(P(H | D, \theta^\mathrm{old})\) which must be computed before the parameters can be optimized. This leads to a two-step algorithm which alternates between computing this posterior in terms of our old parameters \(\theta^{\mathrm{old}}\) and then opimizing the parameters in terms of this posterior.

It is beyong the scope of this course, but it can be proven that this two stage process is equivalent to optimizing a lower bound on the incomplete data log likelihood. However, EM algorithms do not provide any general guarantees that the parameters they discover are globally optimal. They simply guarantee that the likelihood of the data will not be decreased (and will likely increase) on each step of the iteration. But they can get stuck in local maxima of the search space.


36 The Baum-Welch Algorithm 38 Gibbs Sampling