One major issue with the bag of categories model is that we have given up our ability to model sequential dependencies once again. How can we fix this? One way is to imagine that we have an $n$-gram model over the categories. Since another name for an $n$-gram model is a Markov model models with a $n$-gram model of categories are known as hidden Markov models (HMM). This is because they make the Markov assumption—which says that observations at a position only depend on the preceding \(n-1\) symbols steps, except they make this assumption on the latent or hidden variables representing the categories. HMMs are one of the most important models in all of science, have a very well-developed theory and are used in thousands of practical applications. In squiggle notation, an HMM can be expressed by the following lines.

\[\begin{align} \{\vec{\theta}_{T,c}\} &\sim& \mathrm{Dirichlet}(\vec{\alpha}_{T})\\ \{\vec{\theta}_{O,c}\} &\sim& \mathrm{Dirichlet}(\vec{\alpha}_{O})\\ c^{(i)} \mid c^{(i-1)} &\sim& \mathrm{categorical}(\vec{\theta}_{T, c^{(i-1)}})\\ w^{(i)} \mid c^{(i)} &\sim& \mathrm{categorical}(\vec{\theta}_{O, c^{(i)}})\\ \end{align}\]

Or in probability notation.

\[\begin{align} &\Pr(w^{(1)},\dots,w^{(k)},c^{(1)},\cdots,c^{(k)} , \{\vec{\theta}_{T,c}\}, \{\vec{\theta}_{O,c}\} )\\ =& \Pr(\{\vec{\theta}_{T,c}\} \mid \vec{\alpha}_{T}) \Pr(\{\vec{\theta}_{O,c}\} \mid \vec{\alpha}_{O}) \prod_{i=1}^k \Pr(w^{(i)} , c^{(i)} \mid c^{(i-1)}, \vec{\theta}_{O, c^{(i)}}, \vec{\theta}_{T, c^{(i-1)}})\\ =& \Pr(\{\vec{\theta}_{T,c}\} \mid \vec{\alpha}_{T}) \Pr(\{\vec{\theta}_{O,c}\} \mid \vec{\alpha}_{O}) \prod_{i=1}^k \Pr(w^{(i)} \mid c^{(i)}, \vec{\theta}_{O, c^{(i)}} ) \Pr(c^{(i)} \mid c^{(i-1)}, \vec{\theta}_{T, c^{(i-1)}}) \end{align}\]

Because they are so common and well-studied, HMMs come with their own special terminology. In HMMs, the set of categories are called states. The distribution between states is called the transition distribution and the distribution over words given a state is called an observation distribution (the same terminology we used in Latent Structure).

Implementing HMMs

Implementing HMMs using memoization is easy.

Start by defining each of the individual components of the model. This is very similar to the bag-of-categories model, but now the category at each step depends on the category at the previous step.

(def categories '(N V Adj Adv P stop))

(def vocabulary '(Call me Ishmael))

(def category->transition-probs
  (memoize (fn [category]
             (sample-dirichlet
              (repeat (count categories) 1)))))

(defn sample-category [preceding-category]
  (sample-categorical
   categories
   (category->transition-probs preceding-category)))

(def category->observation-probs
  (memoize (fn [category]
             (sample-dirichlet
              (repeat (count vocabulary) 1)))))

(defn sample-observation [category]
  (sample-categorical
   vocabulary
   (category->observation-probs category)))

Then we use these components to define the sampler for the joint distribution over sequences of words and categories, conditioned on a starting category.

(defn sample-categories-words [preceding-category]
  (let [c (sample-category preceding-category)]
    (if (= c 'stop)
      '()
      (cons
       [c (sample-observation c)]
       (sample-categories-words c)))))

(sample-categories-words 'start)

Scoring HMMs

Once again, as was the case for our Dirichlet-categorical bag-of-categories models, scoring is easy if we have access to both the words of a sentence and the categories of each word. Again, this is very similar to scoring from the categorical bag-of-categories model, except that now each category depends on the previous category.

(defn score-category-word [preceding-category current-category word]
  (+ 
   (score-categorical current-category categories 
                      (category->transition-probs preceding-category))
   (score-categorical word vocabulary 
                      (category->observation-probs current-category))))

(defn get-cat [s] (first s))
(defn get-word [s] (second s))

(defn score-categories-words [preceding-category cws]
  (if (empty? cws)
    0
    (let [c (get-cat (first cws))
          w (get-word (first cws))]
      (+ (score-category-word preceding-category c w)
         (score-categories-words c (rest cws))))))

(score-categories-words 'start '([N  me] [V Call]))

This gives us the probability

\[\Pr(w^{(1)},\dots,w^{(k)},c^{(1)},\cdots,c^{(k)} \mid \{\vec{\theta}_{T,c}\}, \{\vec{\theta}_{O,c}\} )\]

However, if we do not have access to the categories of individual words, we run into significantly more challenging problems. In the next unit, we examine these problems in more detail.

\[\begin{multline} \Pr(w^{(1)},\dots,w^{(k)}\mid \{\vec{\theta}_{T,c}\}, \{\vec{\theta}_{O,c}\} )=\\ \quad \quad \sum_{c^{(1)}\in \mathcal{S}} \cdots \sum_{c^{(k)}\in \mathcal{S}} \Pr(w^{(1)},\dots,w^{(k)},c^{(1)},\cdots,c^{(k)} \mid \{\vec{\theta}_{T,c}\}, \{\vec{\theta}_{O,c}\} ) \end{multline}\]

23 Latent Structure 25 The Forward Algorithm