With our definitions of probabilistic samplers and scorers in place, we now return to some of the advantages of having a notion of a probabilistic formal language. The most important reason is that they give us a better way to connect our abstract models with real world data.

Real-world natural language datasets like the text of a book, the set of sentences heard by a child learning language, or the set of English tweets differ from idealized formal languages in many ways. A formal language contains each string in the langauge exactly once. Obviously, no real world sample of English contains all and only the sentences of English exactly once! There are many reasons that a particular sentence might or might not appear in a real-world dataset: the meanings that a speaker or writer wishes to express, the style that they wish to use, the people they are communicating with, mistakes made during speaking or writing, and many many others. Note that the vast majority of these reasons go well beyond whether a sentence is well-formed or not. Real world datasets are also not unboundedly large. As we saw earlier, English contains an unbounded number of natural language sentences. Any actual sample of natural language sentences must be finite, and thus leave out almost all well-formed English sentences. Again, the reasons that particular sentences are left out may have very little to do with well-formedness. Real samples of natural language will also sometimes contain repeated instances of the same sentence (though less often than our intuitions suggest). That is, the sentences we observe in a real world dataset also come in different frequencies.

Many arguments in linguistics abstract away from such factors of linguistic variation and usage. It is important to have tools for connecting abstract theories with data while accounting for unexplained variation in a principled way. This is especially true when we want to build models of language learning and processing. Critically, as we will see below, adding probabilities to our formal languages will allow us to measure goodness of fit of our models to real world datasets.

Corpora

To see how this works, we will need to formalize the notion of a real-world dataset. To do this we will introduce the (formal) notion of a corpus. A corpus \(\mathbf{C}\) is a finite multi-set of strings. A multi-set is the set theoretic name of a set that is allowed to have multiple distinguished copies. For example, the set of sentences in Moby Dick is a multiset. The first few strings in this set are:

Call me Ishmael .

Some years ago — never mind how long precisely — having little or no money in my purse , and nothing particular to interest me on shore , I thought I would sail about a little and see the watery part of the world .

It is a way I have of driving off the spleen and regulating the circulation .

Whenever I find myself growing grim about the mouth ; whenever it is a damp , drizzly November in my soul ; whenever I find myself involuntarily pausing before coffin warehouses , and bringing up the rear of every funeral I meet ; and especially whenever my hypos get such an upper hand of me , that it requires a strong moral principle to prevent me from deliberately stepping into the street , and methodically knocking people’s hats off—then , I account it high time to get to sea as soon as I can .

This is my substitute for pistol and ball .

With a philosophical flourish Cato throws himself upon his sword ; I quietly take to the ship .

There is nothing surprising in this .

If they but knew it , almost all men in their degree , some time or other , cherish very nearly the same feelings towards the ocean with me .

Note that here we have tokenized this text. That is, we have put spaces between things like words and punctuation and other symbols that we would like to distinguish from one another in a formal model of this text.

Goodness of Fit of a Model to a Corpus

A natural way of defining a goodness-of-fit score for a model is by asking what probability it assigns to a corpus, a quantity also sometimes called the likelihood of the model given the data—people also commonly say “likelihood of the data”—though this is technically an incorrect usage.

To compute the probability of a corpus we respect to a distribution given by some model, we will assume that each sentence in our corpus was independently sampled or drawn from our probabilistic formal language. If we further assume that the each sentence is drawn from the same probabilistic formal language—that is, that the distribution over draws is identical—then we say that our dataset was drawn i.i.d. This abbreviation in statistics and probability theory stands for independent and identically distributed and is very commonly used.

Corpus Samplers

The assumption that our corpus is generated i.i.d. is embodied by the sampler for a corpus, which can be defined as follows.

(defn sample-corpus [generator size]
  (if (= size 0)
    '()
    (cons (generator)
	  (sample-corpus generator (- size 1)))))

(sample-corpus sample-ab* 4)

Corpus Scorers

According to the laws of probability theory, the probability of multiple independent events all occuring is the product of their individual probabilities. Thus, the probability of an i.i.d. corpus is given by the product of the probabilities of the strings which appear in it, under the distribution of the formal language.

Thus, we can define scorers for a corpus like that below.

(def my-corpus
  (list '(a b a b)
        '()
        '(a b)
        '()
        '(a b)
        '(a b a b a b)
        '(a b)
        '(a b a b a b a b)))

(defn score-corpus [scorer corpus]
  (if (empty? corpus)
    1.0
    (* (scorer (first corpus))
       (score-corpus scorer (rest corpus)))))

(score-corpus score-ab* my-corpus)

There is a close relationship between samplers and scorers. In particular, any corpus which is sampled by a generator can be scored by a scorer. The score of that corpus is simply the probably that the generator would have generated it in the first under random chance. This connection is important to understanding how probabilities can be used for learning and inference, as we will see in upcoming lectures.

Fit of a Language to a Corpus

With language samplers and scorers defined, we are now in a position to evaluate the probability or score of a dataset, or likelihood of the model. You can think about this two ways. First, you can think about it as a measure of how probable the data was to have been sampled from the model. But a second way to think about likelihood is as a measure of how good a model is for explaining a corpus of observations. If it assigns high probability to them by matching empirical corpus frequencies well, it may be a good model of the domain.

Imagine that we had two different probabilistic formal languages represented as computer programs. How can we tell which model better explains some corpus data?

To study this question, let’s define a sampler and a scorer for a new formal language \(\{a,b\}^*\).

(defn sample-kleene-ab []
  (if (flip)
      '()
      (cons (if (flip) 'a 'b)
	                (sample-kleene-ab))))

(sample-kleene-ab)

Here we use a similar recursion to our definition of sample-ab* except that on each pass through alternative of the condition we randomly sample either a or b to cons onto the resulting string.

What does this language look like?

(hist (repeatedly 1000 sample-kleene-ab))

We can also implement a scoring function for this language.

(defn score-kleene-ab [str]
  (if (empty? str)
    0.5
    (* 0.5
       (cond 
         (= (first str) 'a) 0.5
         (= (first str) 'b) 0.5
         :else 0)
       (score-kleene-ab (rest str)))))

[['epsilon (score-kleene-ab '())]
 ['a       (score-kleene-ab '(a))]
 ['ab      (score-kleene-ab '(a b))]
 ['ababab  (score-kleene-ab '(a b a a))]]

Now, let’s compare the scores of various strings in \((ab)^*\) and \(\{a,b\}^*\) under these two probabilistic formal languages.

[["string" "score in (ab)*"     "score in {a,b}*"]
 ['epsilon (score-ab* '())      (score-kleene-ab '())]
 ['ab      (score-ab* '(a b))   (score-kleene-ab '(a b))]
 ['ababab  (score-ab* '(a b a b a b))  (score-kleene-ab '(a b a b a b))]
 ['bb      (score-ab* '(b b))   (score-kleene-ab '(b b))]
 ['abb     (score-ab* '(a b b)) (score-kleene-ab '(a b b))]]

What we can see here is that the probabilistic formal language defined on \(\{a,b\}^*\) assigns probability mass to all the strings in \((ab)^*\), but also assigns mass to strings not in \((ab)^*\). As a consequence, it assigns less probability to these strings in aggregate. This is sometimes called the law of conservation of belief and is a direct consequence of the fact that probability distributions must sum to \(1\).

As a result, when we compare the scores that each model assigns to the corpus below, which is drawn from the language \((ab)^*\), the probabilistic language over \((ab)^*\) will assign a higher likelihood to the corpus.

(def my-corpus
  (list '(a b a b)
        '()
        '(a b)
        '()
        '(a b)
        '(a b a b a b)
        '(a b)
        '(a b a b a b a b)))

[["likelihood of score-(ab)*"  (score-corpus score-ab* my-corpus)]
 ["likelihood of score-{a,b}*" (score-corpus score-kleene-ab my-corpus)]]

We can conclude that as a model of this dataset the probabilistic formal language on \((ab)^*\) is better than the one on \(\{a,b\}^*\). This captures an intuitive notion of goodness of fit of the language to the corpus. In general, using probabilities to measure goodness of fit will automatically penalize less restrictive models since they must, in general, assign probability mass to a larger number of outcomes and, thus, assign the average dataset lower probability. This principle is sometimes called Bayes’ Occam’s Razor and is an important reason to use probability theory for modeling.


8 Distributions over Languages 10 Discrete Random Variables