N-gram models are fundamentally unable to capture certain kinds of linguistic structure. In particular, they are only able to capture bounded dependencies—dependencies between words within a certain, fixed distance of each other. This property led to
the abstract characterization of strictly local languages using suffix substitution closure as a consequence.

As we have seen, however, natural languages contain unbounded dependencies. Whether or not a string is grammatical may depend on contingencies between parts of the string which are arbitrarily far away. For example, the language \(L_{\leq 1b}\) requires a mechanism which can remember where and when the \(b\) has occurred before in order to check that no other symbol matches \(b\) anywhere else in the string.

In order to capture unbounded dependencies, and other aspects of linguistic structure, we will make use of what are known as latent or hidden random variables into our models; that is, random variables whose values are not directly observed. In fact, we have already seen several examples of a latent variables in earlier lectures: the variable \(\theta\) in our definitions of the Dirichlet-categorical distribution, as well as the variables \(\theta_{\mathbf{c}}\)’s that parameterize the distributions over next symbols in our Dirichlet-categorical ngram model. In these models, these distributions over the vocabulary items are unobserved and we had to use inference via Bayesian updating, or the principle of maximum likelihood to infer their values. However, these hidden variables were both global and continuous parameters—they were fixed once for the whole model by sampling from a continuous probability distribution (although the latter values were sampled lazily). Now we will add a different kind of latent variables to our models; one which is discrete and local to different sentences: word categories.

Adding Word Categories

When we discussed the limitation of the categorical BOW model, we discussed the dependency between words in the sentence

Colorless green ideas sleep furiously.

and how these helped to distinguish this sentence from the sentence

\(^*\)Furiously sleep ideas green colorless.

For example, we noted that green is an adjective which modifies the noun ideas. This description makes use of a certain kind of latent structure—whether a word is an adjective, a noun or another grammatical category, lexical category, or part of speech. This kind of information is of course crucial in describing the structure of sentences—all models of grammar make the assumption that words can be grouped into categories which describe their grammatical behavior. So, as the next step in our process of building more accurate models of grammatical and ungrammatical sentences of a language, let’s add word categories.

Bags-of-Categories/Mixture Models

Perhaps the simplest way to add category information is simply assume that the lexicon is divided into categories, each of which has their own distribution over words. In order to sample from such a model, we first sample a category for each word, and then sample a word from the distribution over words given that category. If \(\mathcal{S}\) is a set of categories, and we further assume Dirichlet priors over all of the relevant distributions, our model becomes.

\[\begin{align} \vec{\theta}_{\mathcal{S}} &\sim& \mathrm{Dirichlet}(\vec{\alpha}_{\mathcal{S}})\\ \{\vec{\theta}_{c} \}_{c \in \mathcal{S}} &\sim& \mathrm{Dirichlet}(\vec{\alpha})\\ c^{(i)} &\sim& \mathrm{categorical}(\vec{\theta}_{\mathcal{S}})\\ w^{(i)} \mid c^{(i)} &\sim& \mathrm{categorical}(\vec{\theta}_{c^{(i)}})\\ \end{align}\]

In such a model, the distributions over words \(\mathrm{categorical}(\vec{\theta}_{c^{(i)}})\) are often called observation distributions/models or emission distributions/models since they generate (or emit) the observed words.

In probability notation the conditional independencies of this model can be expressed as:

\[\begin{align} &\Pr(w^{(1)},\dots,w^{(k)},c^{(1)},\cdots,c^{(k)} , \{\vec{\theta}_{c}\} ,\vec{\theta}_{\mathcal S}\mid \vec{\alpha},\vec{\alpha}_{\mathcal S})\\ &= \Pr(\{\vec{\theta}_{c}\}\mid \vec{\alpha}) \Pr(\vec{\theta}_{\mathcal S}\mid \vec{\alpha}_{\mathcal S}) \Pr(w^{(1)},\dots,w^{(k)},c^{(1)},\cdots,c^{(k)} \mid \{\vec{\theta}_{c} \},\vec{\theta}_{\mathcal S})\\ &= \Pr(\{\vec{\theta}_{c}\}\mid \vec{\alpha}) \Pr(\vec{\theta}_{\mathcal S}\mid \vec{\alpha}_{\mathcal S}) \prod_{i=1}^k \Pr(w^{(i)},c^{(i)} \mid \vec{\theta}_{c^{(i)}} ,\vec{\theta}_{\mathcal S})\\ &= \Pr(\{\vec{\theta}_{c}\}\mid \vec{\alpha}) \Pr(\vec{\theta}_{\mathcal S}\mid \vec{\alpha}_{\mathcal S}) \prod_{i=1}^k \Pr(w^{(i)} \mid c^{(i)} , \vec{\theta}_{c^{(i)}} ) \Pr(c^{(i)} \mid \vec{\theta}_{\mathcal{S}}) \end{align}\]

Let’s express this model in Clojure.

First, define each of the individual components of the model.

(def categories '(N V Adj Adv P stop))

(def vocabulary '(Call me Ishmael))

(def category-probs
		(sample-dirichlet
		(repeat (count categories) 1)))

(defn sample-category []
  (sample-categorical
   categories
   category-probs))

(def category->observation-probs
     (memoize (fn [category]
		(sample-dirichlet
		(repeat (count vocabulary) 1)))))

(defn sample-observation [category]
     (sample-categorical
      vocabulary
      (category->observation-probs category)))

Then we can use these components to define the sampler for the joint distribution over sequences of words and categories.

(defn sample-categories-words []
  (let [cat (sample-category)]
    (if (= cat 'stop)
      '()
      (cons
       [cat (sample-observation cat)]
       (sample-categories-words)))))

(sample-categories-words)

Typically, we wish to define a generative model that outputs just the observed words, and not their categories—because observed word sequences in real corpora do not come labeled with category information!

In other words, we wish to define the marginal distribution over words. Setting aside the prior distributions over \(\{\vec{\theta}_{c} \}_{c \in \mathcal{S}}\) and \(\vec{\theta}_{\mathcal{S}}\) for the moment, we can write this marginal as.

\[\Pr(w^{(1)},\dots,w^{(k)} \mid \{\vec{\theta}_{c} \}, \vec{\theta}_{\mathcal{S}})\]

When we sample, this marginal distribution is given by simply ignoring or forgetting the particular categories sampled to produce each word.

(defn sample-words []
  (let [cat (sample-category)]
    (if (= cat 'stop)
      '()
      (cons (sample-observation cat)
	  (sample-words)))))

(sample-words)

How can we score such a model?

Scoring a Bag-of-Categories Models

As always, we would like to define a scoring function for our model, but now we have a problem. If we are given a list of words together with their categories, as generated by sample-categories-words, scoring is relatively straightforward. Note that, as usual, we work in the log domain here to avoid underflow.

(defn score-category-word [c w]
  (+
   (score-categorical c categories category-probs)
   (score-categorical w vocabulary 
                      (category->observation-probs c))))

(defn get-word [cw] (second cw))
(defn get-cat [cw] (first cw))

(defn score-categories-words [cws]
  (if (empty? cws)
    0.0
    (+
     (score-category-word (get-cat (first cws)) (get-word (first cws)))
     (score-categories-words (rest cws)))))

(score-categories-words '([V Call] [N me] [N Ishmael]))

This however, gives us the probability

\[\Pr(w^{(1)},\dots,w^{(k)},c^{(1)},\cdots,c^{(k)} \mid \{\vec{\theta}_{c} \}, \vec{\theta}_{\mathcal{S}})\]

By analogy with our earlier scoring procedures, what we would like is a way to compute the probability of just the words.

\[\Pr(w^{(1)},\dots,w^{(k)} \mid \{\vec{\theta}_{c} \}, \vec{\theta}_{\mathcal{S}})\]

In fact, to be completely precise, we want the probability of a given word sequence and the final state being the stop symbol.

\[\Pr(w^{(1)},\dots,w^{(k)},C^{(k+1)}=\ltimes \mid \{\vec{\theta}_{c} \}, \vec{\theta}_{\mathcal{S}})\]

This represents the probability that the model generated this particular word sequence and then stopped.

To compute this, we must marginalize over all possible sequences of hidden categories that could have produced those words.

\[\begin{multline} \Pr(w^{(1)},\dots,w^{(k)},C^{(k+1)}=\ltimes \mid \{\vec{\theta}_{c} \}, \vec{\theta}_{\mathcal{S}})=\\ \quad \quad\sum_{c^{(1)}\in \mathcal{S}} \cdots \sum_{c^{(k)}\in \mathcal{S}} \Pr(w^{(1)},\dots,w^{(k)}, c^{(1)},\dots,c^{(k)},C^{(k+1)}=\ltimes \mid \{\vec{\theta}_{c} \}, \vec{\theta}_{\mathcal{S}}) \end{multline}\]

Substituting in our expression for our bag of categories model, we get.

\[\begin{multline} \Pr(w^{(1)},\dots,w^{(k)},C^{(k+1)}=\ltimes \mid \{\vec{\theta}_{c} \}, \vec{\theta}_{\mathcal{S}}) =\\ \quad \quad\Pr(C^{(k+1)}=\ltimes) \sum_{c^{(1)}\in \mathcal{S}} \cdots \sum_{c^{(k)}\in \mathcal{S}} \prod_{i=1}^k \Pr(w^{(i)} \mid c^{(i)}, \vec{\theta}_{c^{(i)}}) \Pr(c^{(i)} \mid \vec{\theta}_\mathcal{S}) \end{multline}\]

These outer sums loop over every possible sequence of hidden categories. This means that we must loop over the number of hidden categories raised to the length of the sentence. That is, the outer loops consider \(|\mathcal{S}|^k\) possible combinations of categories. This rapidly becomes intractable.

How can we deal with this problem? Notice that the choices of category at each position are independent, and the choice of word at a given position only depends on the category at that position. This means that the category at each position doesn’t get to know about the choices at other positions when we choose the corresponding word. As a result, we can pull out all of the terms that correspond to a single position \(i\) in the expression above.

\[\begin{align*} \sum_{c^{(1)} \in S}\cdots \sum_{c^{(k)} \in s}& \prod_{i=1}^k \Pr(w^{(i)} | c^{(i)}, \theta_{c^{(i)}}) \Pr(c^{(i)}|\vec{\theta}_{\mathcal{S}})\\ &= \sum_{c^{(1)}\in \mathcal{S}} \left[ \Pr(w^{(1)}|c^{(1)},\vec{\theta}_{c^{(1)}}) \Pr(c^{(1)}|\vec{\theta}_{\mathcal{S}}) \left( \sum_{c^{(2)} \in \mathcal{S}}\cdots \sum_{c^{(k)} \in \mathcal{S}} \prod_{i=2}^k \Pr(w^{(i)} | c^{(i)}, \vec{\theta}_{c^{(i)}}) \Pr(c^{(i)}|\vec{\theta}_{\mathcal{S}}) \right) \right] \\ &= \left[ \sum_{c^{(2)} \in \mathcal{S}}\cdots \sum_{c^{(k)} \in \mathcal{S}} \prod_{i=2}^k \Pr(w^{(i)} | c^{(i)}, \vec{\theta}_{c^{(i)}}) \Pr(c^{(i)}|\vec{\theta}_{\mathcal{S}}) \right] \sum_{c^{(1)}\in \mathcal{S}} \Pr(w^{(1)}|c^{(1)},\vec{\theta}_{c^{(1)}}) \Pr(c^{(1)}|\vec{\theta}_{\mathcal{S}}) \end{align*}\]

Iterating this process \(k\) times, we see that the marginal probability of the sequence of words is given by:

\[\begin{multline} \Pr(w^{(1)},\dots,w^{(k)},C^{(k+1)}=\ltimes \mid \{\vec{\theta}_{c}\}, \vec{\theta}_{\mathcal{S}}) =\\ \quad \quad \Pr(C^{(k+1)}=\ltimes) \prod_{i=1}^k \sum_{c^{(i)} \in \mathcal{S}} \Pr(w^{(i)} \mid c^{(i)}, \vec{\theta}_{c^{(i)}}) \Pr(c^{(i)} \mid \vec{\theta}_\mathcal{S}) \end{multline}\]

Now, instead of a product inside of an exponential number of sequences of categories, we have a product over a sum sets of categories at each position.

This can be seen as a direct consequence of distributivity. For instance, for $X$ and $Y$ two discrete sets of numbers, \(\sum_{x \in X}\sum_{y \in Y} x y = \left(\sum_{x \in X} x\right) \left(\sum_{y \in Y} y\right)\). More generally, \(\sum_{x^{(1)} \in X_1}\cdots\sum_{x^{(N)} \in X_N} \prod_{n=1}^N x^{(n)} = \prod_{n=1}^N \sum_{x^{(n)} \in X_n} x^{(n)}\).

The fact that is possible is most easily illustrated with an example. Imagine that we just want to look at the probability of the sequence me Ishmael. Furthermore, only consider the categories of noun and verb. The expression above becomes the following.

\[\begin{align} \Pr(\textit{me}\,|N)\cdot\Pr(N)\quad\cdot&\quad\Pr(\textit{Ishmael}\,|V)\cdot\Pr(V)&& \quad + \\ \Pr(\textit{me}\,|N)\cdot\Pr(N)\quad\cdot&\quad\Pr(\textit{Ishmael}\,|N)\cdot\Pr(N)&& \quad+ \\ \Pr(\textit{me}\,|V)\cdot\Pr(V)\quad\cdot&\quad\Pr(\textit{Ishmael}\,|V)\cdot\Pr(V)&& \quad+ \\ \Pr(\textit{me}\,|V)\cdot\Pr(V)\quad\cdot&\quad\Pr(\textit{Ishmael}\,|N)\cdot\Pr(N)&& \end{align}\]

But in fact, this expression can be factored thanks to the conditional independence of the word positions.

\[\begin{align} \Pr(\textit{me}\,|N)\Pr(N)&\cdot[\Pr(\textit{Ishmael}\,|V)\Pr(V) + \Pr(\textit{Ishmael}\,|N)\Pr(N)]& + \\ \Pr(\textit{me}\,|V)\Pr(V)&\cdot[\Pr(\textit{Ishmael}\,|V)\Pr(V) + \Pr(\textit{Ishmael}\,|N)\Pr(N)]& \end{align}\]

Factoring it completely, we get:

\[\begin{align} &\big[\Pr(\textit{me}\,|N)\Pr(N) &+ &\Pr(\textit{me}\,|V)\Pr(V)&\big] &\cdot \\ &\big[\Pr(\textit{Ishmael}\,|N)\Pr(N) &+ &\Pr(\textit{Ishmael}\,|V)\Pr(V)&\big]& \end{align}\]

This is a much friendlier expression since in general it will only contain as many terms as the length of the string, not as many as possible state sequence.

(defn score-words [ws]
  (if (empty? ws)
    0.0
    (let [w (first ws)
          joint-scores (map
                        (fn [c] (score-category-word c w))
                        categories)]
      (+ 
       (apply logsumexp joint-scores)
       (score-words (rest ws))))))

(score-words '(Call me Ishmael))


Topic Models

One obvious problem with such a model is that we have now given up our ability to encode sequential dependencies between words in different positions. Can you see why this is true? Nevertheless, bag of categories models have an important application in applied natural language processing in that they form the basis of topic models. In topic models, the categories are called topics and each document is assumed to have its own distribution over word categories, which is learned from data. Because of the simple nature of these models, working with them can be quite efficient and they have many practical applications.


22 Abstract Characterizations 24 Hidden Markov Models