Our goal is to represent the the posterior distribution over sequences of categories for an HMM/FSA.

\[\Pr(c^{(1)},\cdots,c^{(k)} \mid w^{(1)}, \cdots, w^{(k)})=\frac{\prod_{i=1}^k \Pr(w^{(i)} \mid c^{(i)}) \Pr(c^{(i)} \mid c^{(i-1)})}{\sum_{c^{\prime (1)},\cdots,c^{\prime (k)}}\prod_{i=1}^k \Pr(w^{(i)} \mid c^{(i)}) \Pr(c^{(i)^\prime} \mid c^{(i-1)})}\]

Recall that we can represent a distribution either as a sampler or a scorer. In this unit, we will look at how we can implement efficient samplers and scorers for this posterior distribution—algorthims that get around the problem of exponential blow up that we get if we try to represent this distribution in the naive way.

The denominator of the expression above is just the forward probability of the whole string. However,as before, this expression does not suggest an efficient way to represent this distribution. In particular, it suggests no way to sample a sequence other than to enumerate all sequences, score each one, and then renormalize using the forward probability. Since there are an exponential number of sequences, this is not a useful way to do sampling from the posterior.

Can we do better?

Let’s consider the posterior probability of an arbitrary state at time step \(k\). Let’s assume that we know that the state at time step \(k+1\) is equal to some particular state \(c^{(k+1)}\). We will consider the conditional probability over states at time step \(c^{(k)}\) given that the state in time step \(k+1\), \(c^{(k+1)}\) adn the string. In other words

\[\Pr(c^{(k)} \mid c^{(k+1)}, w^{(1)}, \cdots, w^{(n)})\]

First, let’s consider the joint distribution,

\[\gamma_{t}(c_{1}^{(t)}, c_{2}^{(t+1)}) = \Pr(c_{1}^{(t)}, c_{2}^{(t+1)}, w^{(1)}, \cdots, w^{(\vec{w})})\]

This is equal to

\[\mathbf{fw}(w^{(1)}, \cdots, w^{(k)}, c^{(k)})\Pr(c^{(k+1)}\mid c^{(k)})\Pr(w^{(k+1)}\mid c^{(k+1)})\mathbf{bk}(c^{(k+1)}, w^{(k+2)},\cdots,w^{(n)})\]

By the definition of conditional probability

\[\Pr(c^{(k)} \mid c^{(k+1)}, w^{(1)}, \cdots, w^{(n)}) = \frac{\Pr(c^{(k)}, c^{(k+1)}, w^{(1)}, \cdots, w^{(n)})}{\sum_{\bar{c}^{(k)}} \Pr(\bar{c}^{(k)}, c^{(k+1)}, w^{(1)}, \cdots, w^{(n)})}\]

Substituting in the expression above.

\[\Pr(c^{(k)} \mid c^{(k+1)}, w^{(1)}, \cdots, w^{(n)}) = \frac{\mathbf{fw}(w^{(1)}, \cdots, w^{(k)}, c^{(k)})\Pr(c^{(k+1)}\mid c^{(k)}) Pr( w^{(k+1)} \mid c^{(k+1)}) \mathbf{bk}(c^{(k+1)}, w^{(k+1)},\cdots,w^{(n)})}{\sum_{\bar{c}^{(k)}} \mathbf{fw}(w^{(1)}, \cdots, w^{(k)}, \bar{c}^{(k)})\Pr(c^{(k+1)}\mid \bar{c}^{(k)}) \Pr( w^{(k+1)} \mid c^{(k+1)}) \mathbf{bk}(c^{(k+1)}, w^{(k+2)},\cdots,w^{(n)})}\]

Finally, noting that \(\Pr( w^{(k+1)} \mid c^{(k+1)})\) and \(\mathbf{bk}(c^{(k+1)}, w^{(k+1)},\cdots,w^{(n)})\) are constants in the above expression so they cancel.

\[\Pr(c^{(k)} \mid c^{(k+1)}, w^{(1)}, \cdots, w^{(n)}) = \frac{\mathbf{fw}(w^{(1)}, \cdots, w^{(k)}, c^{(k)})\Pr(c^{(k+1)}\mid c^{(k)}) }{\sum_{\bar{c}^{(k)}} \mathbf{fw}(w^{(1)}, \cdots, w^{(k)}, \bar{c}^{(k)})\Pr(c^{(k+1)}\mid \bar{c}^{(k)})}\]

Note what we have found here. We can compute the exact probability of each preceding state \(c^{(k)}\) given a specific following state \(c^{(k+1)}\) using the forward probabilities of each. Thus, once we compute the trellis of forward probabilities, we can use these to sample a path through the posterior distribution.

This algorithm is known as forward filtering, backward sampling.


32 Bayesian Parsing Methods 34 Parameter Estimation and Learning