Over the last chapters, we studied the problem of parsing for HMMs/FSAs—that is, the problem of inferring the hidden state/category sequence(s) underlying some observed sentence. We now turn to another, more complex problem of inductive inference: parameter estimation or parameter learning.

Recall that an HMM is parameterize by two sets of probability distributions: transition distributions, which we wrote \(\vec{\theta}_{T,c}\) and observation distributions, which we wrote \(\vec{\theta}_{O,c}\). Let’s use \(\mathbf{\theta}_T\) to refer to the whole set of such transition distributions—one for each state/category—and \(\mathbf{\theta}_O\) to refer to the whole set of such observation/emission distributions, one for each state/category as well.

The parameter estimation or learning problem for HMMs is to rank or choose these distributions given some observed corpus of sentences \(\mathbf{C}\).

Recall that in the maximum likelihood formulation of the problem, we wish to find those parameters which maximize the probability of the corpus. The marginal probability of a corpus \(\mathbf{C}\) given (the parameters of) an HMM is given by the following expression.

\[\begin{align} \Pr(\mathbf{C} \mid \mathbf{\theta}_T, \mathbf{\theta}_O) &=& \prod_{\vec{w} \in \mathbf{C} } \sum_{\vec{c}} \Pr(\vec{w},\vec{c} \mid \mathbf{\theta}_T, \mathbf{\theta}_O) \\ \Pr(\mathbf{C} \mid \mathbf{\theta}_T, \mathbf{\theta}_O) &=& \prod_{\vec{w} \in \mathbf{C} } \sum_{\vec{c}} \prod_{i=1}^{|\vec{w}|} \Pr(c^{(i)} | c^{(i-1)}, \vec{\theta}_{T,c^{(i-1)}}) \Pr(w^{(i)} | c^{(i)}, \vec{\theta}_{O,c^{(i)}}) \end{align}\]

In Maximum Likelihoood, we discussed the maximum likelihood solution for a categorical model and showed how it was related to the counts of each outcome in the dataset. Here, however, we are summing over the sequences of categories. We don’t know the true counts of each categorical outcome, because they relevant information is hidden!

Similarly, in the Bayesian formulation of the problem we seek the posterior distribution over parameters given the corpus.

\[\begin{align} \Pr(\mathbf{\theta}_T, \mathbf{\theta}_O \mid \mathbf{C} )&=& \frac{\Pr(\mathbf{C} \mid \mathbf{\theta}_T, \mathbf{\theta}_O) \Pr(\mathbf{\theta}_O )\Pr(\mathbf{\theta}_T )}{\int_{\mathbf{\theta}_T} \int_{\mathbf{\theta}_O} \Pr(\mathbf{C} \mid \mathbf{\theta}_T, \mathbf{\theta}_O ) \Pr(\mathbf{\theta}_O )\Pr(\mathbf{\theta}_T )\,d\vec{\theta}_O\,d\vec{\theta}_T} \\ \Pr(\mathbf{\theta}_T, \mathbf{\theta}_O \mid \mathbf{C} ) &=& \frac{\Pr(\mathbf{\theta}_O )\Pr(\mathbf{\theta}_T )\prod_{\vec{w} \in \mathbf{C} } \sum_{\vec{c}} \Pr(\vec{w},\vec{c} \mid \mathbf{\theta}_T, \mathbf{\theta}_O) }{\int_{\mathbf{\theta}_T} \int_{\mathbf{\theta}_O} \prod_{\vec{w} \in \mathbf{C} } \sum_{\vec{c}} \Pr(\vec{w},\vec{c} \mid \mathbf{\theta}_T, \mathbf{\theta}_O) \Pr(\mathbf{\theta}_O )\Pr(\mathbf{\theta}_T )\,d\vec{\theta}_O\,d\vec{\theta}_T} \\ \end{align}\]

Note once again we have to evaluate the integrals in the denominator over the sum over sequences of categories. It is not clear how to do this!

In the next several lectures, we will examine these problems in much greater detail and learn several new techniques for handling them via approximate inference.


33 Exact Sampling and Scoring 35 Expectations