In the last units, we introduced languages defined by sets of $n$-grams or $n$-factors. The set of languages that can be defined in this way are called the strictly-local languages. Now we consider putting probability distributions on such languages. This gives rise to a very foundational and important used class of models known as $n$-gram models or Markov models.

Probabilistic N-gram Models

An $n$-gram or $n$-factor captures the idea that the \(n\)th word depends on the \((n-1)\) words that precede it. In a probabilistic setting, we can encode this dependence using conditional probability distributions. In particular, we will imagine each string is generated by choosing the next word conditioned on the preceding \(n-1\) words already chosen. In mathematical notation this can be written as follows.

\[\Pr(W^{(1)},\dots,W^{(l)}) =\prod_i^l \Pr(W^{(i)} | W^{(i-n+1)}=w^{(i-n+1)},\dots,W^{(i-1)}=w^{(i-1)})\]

We call the preceding \(n-1\) words a context. Note that in the model just defined, we must represent a different conditional distribution over next words for each context and each of these distributions is a distribution over the vocabulary \(V\). Thus, a natural way to define these distributions is to use a categorical distribution for each. In squiggle notation, we can write:

\[\begin{align} w^{(i)} \mid w^{(i-n+1)}, \dots, w^{(i-1)} &\sim&\mathrm{categorical}(\vec{\theta}_{w^{(i-n+1)}, \dots, w^{(i-1)}}) & ,\quad n \leq i \leq l\\ w^{(i)} \mid w^{(1)}, \dots, w^{(i-1)} &\sim& \mathrm{categorical}(\vec{\theta}_{\rtimes, w^{(1)}, \dots, w^{(i-1)}}) & ,\quad 1 \le i < n\\ \end{align}\]

or in the case of a bigram model, where \(n=2\):

\[\Pr(W^{(1)},\dots,W^{(l)})=\prod_i^l \Pr(W^{(i)} | W^{(i-1)}=w^{(i-1)})\] \[\begin{align} w^{(i)} \mid w^{(i-1)} &\sim&\mathrm{categorical}(\vec{\theta}_{w^{(i-1)}}) & ,\quad 2 \leq i \leq l \\ w^{(1)} &\sim& \mathrm{categorical}(\vec{\theta}_{\rtimes}) & \\ \end{align}\]

To simplify the notation, we will introduce a string-valued variable \(\mathbf{c}\) which ranges over contexts, which will allow us to write:

\[w^{(i)} \mid \mathbf{c} \sim \mathrm{categorical}(\vec{\theta}_{\mathbf{c}})\]

for all of the cases above.

A critical observation about \(n\)-gram models is that there is a separate distribution \(\vec{\theta}_{\mathbf{c}}\) for each \(n-1\) length string \(\mathbf{c}\). In other words, the distributions are just a set of categorical distributions indexed by \(\mathbf{c}\). You can think of these parameter vectors as the elements of a hashtable where the hash keys are given by \(\mathbf{c}\)

Where should the parameter vectors \(\vec{\theta}_{\mathbf{c}}\) come from? A natural solution is to draw each of them from a Dirichlet distribution.

\[\begin{align} \vec{\theta}_{\mathbf{c}} \mid \vec{\alpha} &\sim& \mathrm{Dirichlet}(\vec{\alpha}) \\ w^{(i)} \mid \mathbf{c} &\sim&\mathrm{categorical}(\vec{\theta}_{\mathbf{c}}) & \\ \end{align}\]

There are many possible contexts, and thus such a model has a potentially very large number of distributions and parameters. In particular there are \(\mid V \mid^{(n-1)}\) different sequences that we might need to condition on. For instance if we were using \(6\)-grams (admittedly a largeish \(n\)-gram order) and had a vocabulary of just \(500\) words (a small vocabulary), there would be \(31,250,000,000,000\) possible contexts, and thus \(31,250,000,000,000\) different parameter vectors \(\vec{\theta}_{\mathbf{c}}\). As you can see, this number can grow very large very quickly.

In this example, we draw all of the parameters for these different distributions from a Dirichlet distribution with the same parameter vector \(\vec{\alpha}\). In this case, we say that the parameter vector is shared or tied across the model.

Persistent Randomness with memoize

When we worked with the Dirichlet-categorical bag-of-words model we only had a single draw from our Dirichlet distibution. As noted above, we are now in a starkly different setting, where we have an exponential number of categorical distributions that we need to parameterize. How can we deal with the fact that we want to store and remember so many draws from a Dirichlet?

On idea is that we can draw each vector \(\vec{\theta}_{\mathbf{c}}\) just when we first need it, since most contexts will never come up in practice. To see this note that if we wanted all \(31,250,000,000,000\) contexts in our calculation above to occur even once, we would need a corpus which is at the minimum several times larger than all of the text on the internet (and this is only if we it listed each possible $n$-gram, no repeats). In other words, for models with so many parameters, even very large datasets are necessarily sparse—most of the evidence you would need to see to estimate the probabilities for most contexts is missing from the sample.

One solution is to draw each \(\vec{\theta}_{\mathbf{c}}\) lazily, that is, just when we need it. We also want to avoid having to name a variable by hand for each of these draws. One very useful construct for achieving these things is memoization.

Memoization refers to a technique where we (i) intercept calls to a function on some arguments, that is, we intercept calls to \(f(a_1,\dots,a_k)\). We then (ii) check if the function \(f\) has been called with those arguments \(a_1,\dots,a_k\) before and, if not, call the function on those arguments. One we get the result \(y=f(a_1,\dots,a_k)\), we (iii) store \(y\) in a table associated with the key \(a_1,\dots,a_k\). And, finally, we (iv) return \(y\). On subsequent calls to that function with those arguments, we simply look up the result in the table and avoid recomputing the value.

Memoization is an important technique for optimization in computer science and will play a fundamental role in the algorithms we study later in this course. However, less widely known, is that memoization is also a technique for achieving what is sometimes call persistent randomness in probabilistic models.

In Clojure, there already exists a higher-order procedure, memoize, which takes another procedure and memoizes it.

(defn flip [weight]
  (if (< (rand 1) weight)
      true
      false))

(def mem-flip (memoize flip))
(def mem-flip-2 (memoize flip))

(list
(mem-flip 0.5)
(mem-flip 0.5)
(mem-flip 0.5)
(mem-flip 0.5)
(mem-flip-2 0.5)
(mem-flip-2 0.5)
(mem-flip-2 0.5)
(mem-flip-2 0.5))

Hierarchical $n$-gram Models with Memoization

Using memoize, we can define hierarchical \(n\)-gram models in an elegant way. First, let’s define a function update-context which mantains the context that the Markov model is sampling from.

(defn update-context [n-gram-order old-context new-symbol]
  (if (>= (count old-context) n-gram-order)
    (throw "Context too long!")
    (if (= (count old-context) (- n-gram-order 1))
      (concat  (rest old-context) (list new-symbol))
      (concat old-context (list new-symbol)))))
(list 
	(update-context 3 (list) 'a)
	(update-context 3 (list 'a) 'a)
	(update-context 3 (list 'a 'b) 'a))

The idea behind update-context is that it will drop the first symbol at the beginning of the context and add the next one to the end of the context, to maintain a context of length \(n-1\). If the current context is less than length \(n-1\), it will first build up a context of that length by adding symbols.

Generating from an \(n\)-gram model

We can define a sampler for the \(n\)-gram model using a variant of the list-unfold function that we defined in a Bag of Words Models to use the $n$-gram contexts.

(defn list-unfold [generator n-gram-order context current stop?]
  (if (stop? current)
    (list current)
    (let [new-context (update-context n-gram-order context current)
          nxt (generator new-context)]
      (cons current
            (list-unfold generator
                         n-gram-order 
                         new-context
                         nxt
                         stop?)))))

(list-unfold
 (fn [context] (sample-categorical '(a b) '(0.5 0.5)))
 2                 ; n-gram-order
 '()               ; context
 'start            ; current 
 (fn [w] (= w 'a)) ; stop?
)

Note that we have changed two things about the preceding function from the list-unfold defined earlier. First, our stopping criterion has changed from using a fixed length to randomly generating a stop symbol. Second, we now generate based on a rolling context which we maintain with our update-context function.

Now we will use memoization to sample a persistent distribution for each context which we can use multiple times afterwards.

(def vocabulary '(call me Ishmael stop))
(def alpha (repeat (count vocabulary) 1))

(def context->probs
  (memoize 
   (fn [context] 
     (sample-dirichlet alpha))))

[vocabulary (context->probs '(call me))]

Finally, we define a generator and a stopping-condition function and put it all together.

(def n-gram-order 3)

(defn stop? [word]
(= word 'stop))

(defn generator [context]
  (sample-categorical
   vocabulary
   (context->probs context)))

(list-unfold generator n-gram-order '() 'start stop?)

Scoring an \(n\)-gram model

What about scoring an \(n\)-gram model? First, let’s introduce a scorer that computes the (log) probability of a single word given context: $\log \Pr( w^{(i)} \mid \mathbf{c} )$

(defn log-scorer [context word]
  (Math/log2
   (score-categorical
    word
    vocabulary
    (context->probs context))))

(log-scorer 
  (list 'start 'call) 
  'me)

To score a whole sentence, we accumulate the probability of each word given the preceding $n-1$ words of context.

We can compute this using list-foldr, adjusted to handle contexts in its updates.

(defn list-foldr 
  [accumulator base scorer n-gram-order context lst]
  (if (empty? lst)
    base
    (let [new-context 
          (update-context n-gram-order context (first lst))]
      (accumulator 
       (scorer context (first lst))
       (list-foldr 
        accumulator base scorer n-gram-order new-context (rest lst))))))

(list-foldr 
 +          ; accumulator
 0          ; base
 log-scorer ; scorer
 3          ; n-gram-order
 '(start)   ; context
 '(call me Ishmael stop) 
 )

20 Strictly-Local Languages, Formal Hierarchies, and Weak Generative Capacity 22 Abstract Characterizations