As we emphasized in Classes of Formal Languages, our primary scientific goal is to give a mathematical characterization of universal grammar modeled—for the time being—as a class of formal languages. Now that we have begun looking at probabilistic formal languages defined using categorical distributions over the vocabulary, we can easily generalize the idea of a class of formal languages to a class of probabilistic formal languages or, more accurately, a class of models of probabilistic languages. The bag-of-words approach to generating and scoring sentence structures that we have set up constitutes a class of models, where each particular model is given by choosing some specific vocabulary and distribution over this vocabulary.

Why should we care about classes of probabilistic models? One reason is that by choosing such a class of models, we can begin to define procedures for doing inference about hypotheses, for example, for how learners might choose between models in light of some input data. We can define inference procedures which can serve as formal models of language learning and language processing. In order to do this, we will need a way of choosing between models in a class based on some probabilistic criteria.

Comparing Models in the Class of Categorical BOW Models

If we wish to compare how well models do with respect to a particular training data set, one natural way to do this is by using the likelihood of each model given the dataset. In the last chapter, we introduced the class of categorical BOW models. Each model in this class is parameterized or indexed by some parameter vector \(\vec{\theta} \in \Theta\). We gave an expression for the log likelihood of the model indexed by \(\vec{\theta}\):

\[\mathcal{L}(\vec{\theta}; \mathbf{C})\]

With a such a simple class of models, it may be hard to see how one model could be much better or worse than another—they all have very little structure. But consider the following example. Suppose our corpus consisted of seven copies of the sentence call me Ishmael.

(def my-corpus 
  '((Call me Ishmael)
    (Call me Ishmael)
    (Call me Ishmael)
    (Call me Ishmael)
    (Call me Ishmael)
    (Call me Ishmael)
    (Call me Ishmael)))

(def vocabulary
  '(Call me Ishmael Some years ago - never mind
         how long precisely having little or no money 
         in my purse and nothing particular to interest 
         on shore I thought would sail about a see the 
         watery part of world))

(def probabilities
  (repeatedly
   (count vocabulary)
   (fn [] (/ 1 (count vocabulary)))))

(log-score-corpus my-corpus)

Here we have replaced our corpus with many repeats of the same sentence, while holding our vocabulary constant. Here the probability of each element of our vocabulary is \(\frac{1}{|V|} = \frac{1}{39}\) because we used a uniform distribution over the entire set of words from before. However, in our corpus, we only use three of these words, several times each. Clearly we can do better by setting the probability associated with the words call me and Ishmael higher than the probability of the other words.

Let’s see what happens when we reparameterize our categorical to place all of its probability mass on just the three words that appear in the corpus.

(def vocabulary-subset '(Call me Ishmael))

(def vocabulary-subset-size (count vocabulary-subset))

(def probabilities 
  (concat 
   ; uniform on subset
   (repeat vocabulary-subset-size (/ 1 vocabulary-subset-size)) 
   ; zero outside subset
   (repeat (- (count vocabulary) vocabulary-subset-size) 0))) 

(log-score-corpus my-corpus)

As we can see, this model assigns a significantly higher score to this corpus. This in fact, is much higher probability, remember that the log probability is (negative) the number of times that we must multiply \(\frac{1}{2}\) together to get the probability.

The Principle of Maximum Likelihood

This method of choosing hypotheses is known in statistics as the principle of maximum likelihood (we will see others later). This principle states that we should choose the model in our class which maximizes the likelihood of our data. If our class of models is indexed by some parameter vector \(\vec\theta \in \Theta\) then the principle of maximum likehood says we should choose some model indexed by the parameters \(\vec{\hat{\theta}}{}^{\mathrm{ML}}\) that maximize the (log) likelihood of the model given the data \(\mathbf{C}\), i.e.,

\[\DeclareMathOperator*{\argmax}{arg\,max} \vec{\hat{\theta}}{}^{\mathrm{ML}} = \argmax_{\vec{\theta} \in \Theta} \mathcal{L}(\vec{\theta};\mathbf{C})\]

Recall that the likelihood of the corpus under our model is:

\[\Pr(\mathbf{C} \mid \vec{\theta}) = \prod_{w \in V} \theta_{w}^{n_{w}}\]

In other words, it is just the product of the probabilities of each word raised to the number of times that word appeared in the corpus. Thus, the log likelihood is as follows.

\[\mathcal{L}(\theta;\mathbf{C}) = \log \prod_{w \in V} \theta_{w}^{n_{w}} =\sum_{w \in V} n_{w}\log(\theta_{w})\]

Which probabilities maximize this quantity? It is a bit complicated to show (the usual proof uses Lagrange multipliers), but the probabilities for each word which maximize the log likelihood in the case of a BOW model are

\[\hat{\theta}{}^{\mathrm{ML}}_{w} = \frac{n_{w}}{\sum_{w^\prime \in V} n_{w^\prime}} = \frac{n_{w}}{|\mathbf{C}|}\]

In other words, the optimal probabilities for each word under the principle of maximum likelihood are simply the renormalized counts. Now we can see why our restriction of the vocabulary to just the words in the restricted corpus gave us a higher likelihood—it was the maximum likelihood estimator.


11 Bag-of-Words Models 13 Smoothing