33 Exact Sampling and Scoring
(def categories '(N V Adj Adv P stop))
(def vocabulary '(Call me Ishmael))
(defn logsumexp [& log-vals]
(let [mx (apply max log-vals)]
(+ mx
(Math/log2
(apply +
(map (fn [z] (Math/pow 2 z))
(map (fn [x] (- x mx))
log-vals)))))))
(defn flip [p]
(if (< (rand 1) p)
true
false))
(defn sample-categorical [outcomes params]
(if (flip (first params))
(first outcomes)
(sample-categorical (rest outcomes)
(normalize (rest params)))))
(defn score-categorical [outcome outcomes params]
(if (empty? params)
(throw "no matching outcome")
(if (= outcome (first outcomes))
(Math/log2 (first params))
(score-categorical outcome (rest outcomes) (rest params)))))
(defn normalize [params]
(let [sum (apply + params)]
(map (fn [x] (/ x sum)) params)))
(defn sample-gamma [shape scale]
(apply + (repeatedly
shape (fn []
(- (Math/log2 (rand))))
)))
(defn sample-dirichlet [pseudos]
(let [gammas (map (fn [sh]
(sample-gamma sh 1))
pseudos)]
(normalize gammas)))
(defn update-context [order old-context new-symbol]
(if (>= (count old-context) order)
(throw "Context too long!")
(if (= (count old-context) (- order 1))
(concat (rest old-context) (list new-symbol))
(concat old-context (list new-symbol)))))
(defn hmm-unfold [transition observation order context current stop?]
(if (stop? current)
(list current)
(let [new-context (update-context
order
context
current)
nxt (transition new-context)]
(cons [current (observation current)]
(hmm-unfold
transition
observation
n-gram-order
new-context
nxt
stop?)))))
(defn all-but-last [l]
(cond (empty? l) (throw "bad thing")
(empty? (rest l)) '()
:else (cons (first l) (all-but-last (rest l)))))
Our goal is to represent the the posterior distribution over sequences of categories for an HMM/FSA.
\[\Pr(c^{(1)},\cdots,c^{(k)} \mid w^{(1)}, \cdots, w^{(k)})=\frac{\prod_{i=1}^k \Pr(w^{(i)} \mid c^{(i)}) \Pr(c^{(i)} \mid c^{(i-1)})}{\sum_{c^{\prime (1)},\cdots,c^{\prime (k)}}\prod_{i=1}^k \Pr(w^{(i)} \mid c^{(i)}) \Pr(c^{(i)^\prime} \mid c^{(i-1)})}\]Recall that we can represent a distribution either as a sampler or a scorer. In this unit, we will look at how we can implement efficient samplers and scorers for this posterior distribution—algorthims that get around the problem of exponential blow up that we get if we try to represent this distribution in the naive way.
The denominator of the expression above is just the forward probability of the whole string. However,as before, this expression does not suggest an efficient way to represent this distribution. In particular, it suggests no way to sample a sequence other than to enumerate all sequences, score each one, and then renormalize using the forward probability. Since there are an exponential number of sequences, this is not a useful way to do sampling from the posterior.
Can we do better?
Let’s consider the posterior probability of an arbitrary state at time step \(k\). Let’s assume that we know that the state at time step \(k+1\) is equal to some particular state \(c^{(k+1)}\). We will consider the conditional probability over states at time step \(c^{(k)}\) given that the state in time step \(k+1\), \(c^{(k+1)}\) adn the string. In other words
\[\Pr(c^{(k)} \mid c^{(k+1)}, w^{(1)}, \cdots, w^{(n)})\]First, let’s consider the joint distribution,
\[\gamma_{t}(c_{1}^{(t)}, c_{2}^{(t+1)}) = \Pr(c_{1}^{(t)}, c_{2}^{(t+1)}, w^{(1)}, \cdots, w^{(\vec{w})})\]This is equal to
\[\mathbf{fw}(w^{(1)}, \cdots, w^{(k)}, c^{(k)})\Pr(c^{(k+1)}\mid c^{(k)})\Pr(w^{(k+1)}\mid c^{(k+1)})\mathbf{bk}(c^{(k+1)}, w^{(k+2)},\cdots,w^{(n)})\]By the definition of conditional probability
\[\Pr(c^{(k)} \mid c^{(k+1)}, w^{(1)}, \cdots, w^{(n)}) = \frac{\Pr(c^{(k)}, c^{(k+1)}, w^{(1)}, \cdots, w^{(n)})}{\sum_{\bar{c}^{(k)}} \Pr(\bar{c}^{(k)}, c^{(k+1)}, w^{(1)}, \cdots, w^{(n)})}\]Substituting in the expression above.
\[\Pr(c^{(k)} \mid c^{(k+1)}, w^{(1)}, \cdots, w^{(n)}) = \frac{\mathbf{fw}(w^{(1)}, \cdots, w^{(k)}, c^{(k)})\Pr(c^{(k+1)}\mid c^{(k)}) Pr( w^{(k+1)} \mid c^{(k+1)}) \mathbf{bk}(c^{(k+1)}, w^{(k+1)},\cdots,w^{(n)})}{\sum_{\bar{c}^{(k)}} \mathbf{fw}(w^{(1)}, \cdots, w^{(k)}, \bar{c}^{(k)})\Pr(c^{(k+1)}\mid \bar{c}^{(k)}) \Pr( w^{(k+1)} \mid c^{(k+1)}) \mathbf{bk}(c^{(k+1)}, w^{(k+2)},\cdots,w^{(n)})}\]Finally, noting that \(\Pr( w^{(k+1)} \mid c^{(k+1)})\) and \(\mathbf{bk}(c^{(k+1)}, w^{(k+1)},\cdots,w^{(n)})\) are constants in the above expression so they cancel.
\[\Pr(c^{(k)} \mid c^{(k+1)}, w^{(1)}, \cdots, w^{(n)}) = \frac{\mathbf{fw}(w^{(1)}, \cdots, w^{(k)}, c^{(k)})\Pr(c^{(k+1)}\mid c^{(k)}) }{\sum_{\bar{c}^{(k)}} \mathbf{fw}(w^{(1)}, \cdots, w^{(k)}, \bar{c}^{(k)})\Pr(c^{(k+1)}\mid \bar{c}^{(k)})}\]Note what we have found here. We can compute the exact probability of each preceding state \(c^{(k)}\) given a specific following state \(c^{(k+1)}\) using the forward probabilities of each. Thus, once we compute the trellis of forward probabilities, we can use these to sample a path through the posterior distribution.
This algorithm is known as forward filtering, backward sampling.
← 32 Bayesian Parsing Methods 34 Parameter Estimation and Learning →