In Distributions over Languages, we showed how we can use samplers and scorers to define distributions over infinite formal languages and how we could use these, in turn, to define samplers and scorers for corpora. However, in all of these examples we worked with a small, fixed vocabulary of just \(\{a,b\}\).

How can we define samplers and scorers for a real document, like the text of Moby Dick? Whatever our model, it must at least put probability mass on all of the words that appear in the book, i.e., if it assigned probability \(0\) to the word Ishmael, then the probability of the whole corpus would also be \(0\). Now that we have introduced the categorical distribution, we can begin to define models over arbitrary vocabularies containing any set of words.

One very simple way to use the categorical distribution to model a text (maybe the simplest possible) is called the bag of words (BOW) approach. Under this approach, we assume that each word is generated independently (i.e., separately) from some distribution without reference to the other words. Since there are no dependencies between words, we call the result a simple “bag” of words. Equipped with the categorical distribution, we can easily define a simple bag of words model for sentences.

(def vocabulary (list 'call 'me 'Ishmael))
(def probabilities (list (/ 1 3) (/ 1 3 ) (/ 1 3)))

(defn sample-BOW-sentence [len]
        (if (= len 0)
          '()
          (cons (sample-categorical vocabulary probabilities)
            (sample-BOW-sentence (- len 1)))))

(sample-BOW-sentence 10)

Generating General Sequences: list-unfold

Notice the pattern used to define sample-BOW-sentence. Here we used a recursion on the length of the output string. This function is a special case of a general pattern known as a (list) unfold, which looks like this.

(defn list-unfold [generator len]
        (if (= len 0)
          '()
          (cons (generator)
            (list-unfold generator (- len 1)))))

(defn sample-BOW-sentence [len]
        (list-unfold (fn []
                       (sample-categorical
                         vocabulary
                         probabilities))
          len))

(sample-BOW-sentence 3)

List unfolds, generally speaking, take two pieces of information: (i) a generator, implemented as a thunk, for individual elements of a list and (ii) and some information or criterion that can be used to tell them when to stop generating. We will see several instances of this general pattern.

Scoring a Sequence: list-foldr

An unfold operation is a generator for structured objects like lists. A fold is the inverse operation, which takes a structured object such as a list and reduces it to a single value at the end. As such, we can use list-foldr to implement a scorer for our lists. The r in the name of the function refers to the fact that this is what is known as a right fold. This just means that the recursion first goes to the end of the list before beginning to return—in linguistic terms, this means that the function’s evaluation tree is right-branching.

Each random variable representing a word \(W^{(i)}\) in a BOW model is generated independently. In other words, the distribution over the values each \(W^{(i)}\) can take on are not influenced by the values for any other word. The product rule of probability says that the joint distribution over two independent random variables is given by the product of the probabilities of each random variable in isolation. That is, if call occurs one third of the time, and me occurs one third of the time, then call me occurs one ninth of the time in all sequences of two words under the BOW model.

(def vocabulary '(call me Ishmael))
(def probabilities (list (/ 1 3) (/ 1 3 ) (/ 1 3)))

(defn list-foldr [f base lst]
  (if (empty? lst)
      base
      (f (first lst)
         (list-foldr f base (rest lst)))))

(defn score-BOW-sentence [sentence]
  (list-foldr
   (fn [word rest-score]
     (* (score-categorical word vocabulary probabilities)
        rest-score))
   1
   sentence))

(score-BOW-sentence '(call me Ishmael))

Under a BOW model, each word is sampled independently from the same categorical distribution, so the probability of a string \(w^{(1)}w^{(2)}\ldots w^{(k)}\) is just the product of the probabilities of each of the words used in that string. In other words:

\[W^{(i)}\overset{\rm iid}\sim \operatorname{Categorical}(\vec{\theta}) \qquad \forall i\in\{1,\ldots,k\}\] \[\Pr(W^{(1)}, \dots, W^{(k)} = w^{(1)}, \dots ,w^{(k)}\mid \vec{\theta}) = \prod^k_{i=1} \theta_{w^{(i)}}\]

We are again using \(W^{(i)}\) to refer to random variable which represents the distribution over the \(i\)th word in a sequence. We use \(W^{(i)} = w^{(i)}\) to refer to the event or outcome that the \(i\)th word takes on value \(w^{(i)}\).

Let’s take a moment to note our indexing conventions. We will always use superscript parenthesis notation \(W^{(i)}\) to refer to a random variable or value that appears at position \(i\) in a sequence. We will use subscript notation to refer to particular outcomes in a distribution, such as words in the vocabulary. So, \(w_j\) means the \(j\)th word in our list of vocabulary items (with some indexing scheme). Thus, while \(W^{(i)}=w^{(i)}\) tells us just that the \(i\)th word in a sequence was \(w^{(i)}\), \(W^{(i)}=w_{j}\) tells us that the \(i\)th sequential word was actually the vocabulary word \(j\) (e.g., the word Ishmael). In the expression above, we have made use of \(\theta_{w^{(i)}}\) to refer to the component of vocabulary distribution \(\vec{\theta}\) corresponding to word \(w^{(i)}\). This just picks out the element of the parameter vector which is holding the probability of word \({w^{(i)}}\).

Our distinction between distributions and random variables is clearly important in understanding BOW models. In a BOW model, \(W^{(i)}\) has the same distribution for every \(i\), but clearly these aren’t the same random variable!

Second, the expression

\[\Pr(W^{(1)},\dots, W^{(k)} = w^{(1)},\dots,w^{(k)}\mid \vec{\theta})\]

is called the conditional probability that the random variables \(W^{(1)}, \dots, W^{(k)}\) take on the values \(w^{(1)}, \dots ,w^{(k)}\) given that they were sampled from categorical distribution with parameter \(\vec{\theta}\). The notation \(\Pr( x \mid y )\) is interpreted as the probability that proposition \(x\) is true when the condition \(y\) is true. Since it is usually clear what we mean from context we will often write the simpler version.

\[\Pr(w^{(1)}, \dots ,w^{(k)}\mid \vec{\theta})\]

The expression \(\prod^k_{i=1} \theta_{w^{(i)}}\) represents the probability of a sentence under the bag-of-words model and very closely reflects the way we think about the generative process since it accumulates probability as each word \(w^{(i)}\) is sampled. However, there is a very important alternative way of writing the probability of a sentence under the BOW model based on counts. Since multiplication is commutative and associative, we can gather up all of the terms that refer to any copies of the same word in a sentence. If \(n_w\) is the count of word (type) \(w\) from the vocabulary \(V\) in the sentence, then the probability of the sentence can be written

\[\Pr(w^{(1)},...,w^{(k)}\mid \vec{\theta}) = \prod_{w \in V} \theta_{w }^{n_{w }}\]

Note here that the product loops over the vocabulary not the sampled words in the sentence! All we need really is the counts of each word in the sentence to calculate the probability of a sentence. Why is this?

Evaluating the Bag of Words Model: Likelihoods

Equipped with a scoring function, like the one above, we can evaluate the probability of a dataset under our model. As we mentioned in Distributions over Languages, the score of a data set under some model is often called the likelihood of the data. Technically, the term likelihood refers to a measure of goodness of fit of different models, when the dataset is considered to be fixed, but the models are allowed to vary. That is, likelihood is technically a function of the model (parameters) given the data. Thus, purists will often refer to the likelihood of the model or the likelihood of the parameters rather than the likelihood of the data. But the latter phrase is commonly used.

Let’s set up a toy corpus under our bag-of-words model with a (uniform) categorical distribution as the generator. Just to make it interesting, let’s use a “corpus” with two sentences and a significantly larger vocabulary.

(def my-corpus
     '((Call me Ishmael)
       (Some years ago - never mind how long precisely
	     - having little or no money in my purse ,
	     and nothing particular to interest me on shore
	     , I thought I would sail about a little and
	     see the watery part of the world)))

(def vocabulary
     '(Call me Ishmael Some years ago - never mind
       how long precisely having little or no money 
       in my purse and nothing particular to interest 
       on shore I thought would sail about a see the 
       watery part of world))

(def probabilities
     (repeatedly
      (count vocabulary)
      (fn [] (/ 1 (count vocabulary)))))

(defn score-BOW-sentence [sentence]
  (list-foldr
   (fn [word rest-score]
     (* (score-categorical word vocabulary probabilities)
        rest-score))
   1
   sentence))

;; With these set up we can score the corpus under our model.

(defn score-corpus [corpus]
  (list-foldr
   (fn [sen rest-score]
     (* (score-BOW-sentence sen) rest-score))
   1
   corpus))

(score-corpus my-corpus)

What did we do here? Each sentence in a BOW model the result of sampling each word independently from the categorical with parameters \(\theta\) (uniform in this case) and each sentence is sampled independently from other sentences. Thus, the probability of the whole corpus \(\mathbf{C}\) is given by:

\[\Pr(\mathbf{C} \mid \vec{\theta}) = \prod_{s\in \mathbf{C}} \prod_{w^{(i)} \in s} \theta_{w^{(i)}}\]

Again, since multiplication is associative and commutative, if we know the total count of each word in the corpus as a whole \(n_w\), we can rewrite this expression using word counts.

\[\Pr(\mathbf{C} \mid \vec{\theta}) = \prod_{w \in V} \theta_{w }^{n_{w }}\]

That it, it suffices to simply know the total number of times each word was used in the corpus to calculate the probability of the corpus. This last example illustrates why we usually work in log probabilities. Even with this relatively small corpus and vocabulary, the overall probability of the corpus gets very small very quickly.

Log Likelihoods

As we discussed in the last unit, we often make use of log probabilities in order to avoid underflow errors when probabilties become extremely small—as above. Let’s score the corpus in the last example with log probabilities instead of probabilities.

(defn log-score-BOW-sentence [sen]
  (list-foldr
   (fn [word rest-score]
     (+ (Math/log2 (score-categorical word vocabulary probabilities))
        rest-score))
   0
   sen))

(defn log-score-corpus [corpus]
  (list-foldr
   (fn [sen rest-score]
     (+ (log-score-BOW-sentence sen) rest-score))
   0
   corpus))

(log-score-corpus my-corpus)    

As you can see, this number is of more manageable magnitude. We now have a way to evaluate our model on our corpus numerically. The number, is called the log likelihood of our model given out data. It scores the goodness of fit of the model on this dataset and is a very important quantity in statistics! We will use the notation \(\mathcal{L}(~\cdot~;\mathbf{C})\) for the log likelihood function. This function gives the log score of a model for the corpus. For a bag of words model represented by a probability vector \(\vec{\theta}\), it can be expressed as follows.

\[\mathcal{L}(\vec{\theta}; \mathbf{C}) = \sum_{s\in \mathbf{C}} \sum_{w^{(i)} \in s} \log[\theta_{w^{(i)}}]\]

Or in count form as:

\[\mathcal{L}(\vec{\theta}; \mathbf{C}) = \sum_{w \in V} n_{w }\log[\theta_{w}]\]

10 Discrete Random Variables 12 Maximum Likelihoood