We have now seen how we can use formal languages, or sets of strings, as a model of natural language structure. We have also introduced the idea of describing natural languages in terms of computational processes such as recognizers and generators. With these tools in hand, we now have a framework in which we can plausibly begin to build some models of natural language which are computationally precise.

Before going further, it is useful to consider one more extension of our formal language theoretic model. In our basic model so far, a string is either in a formal language \(L\) or outside of the formal language \(L\), that is, in \((V^* - L)\), but we have no measure distinguishing how prototypical, frequent, or common a string is within a language language. It is useful, for these reasons and others, to introduce the idea of a probability distribution over a formal language or a probabilistic formal language.

A probabilistic formal language is just a formal language where each string \(s\) is associated with a probability \(p(s)\). The function \(p(s)\) satisfies two constraints. First, every probability is a number between \(0\) and \(1\), that is, \(p(s) \in [0,1]\). Second, summing \(p(s)\) over all elements in \(L\) gives \(1\), that is, \(\sum_{s \in L} p(s) = 1\). We refer to the function \(p\) that returns the probability of each string as a probability mass function—this is the term used for such functions when defined on discrete sets. We will talk more about probability mass functions later in the book when we introduce random variables.

If our language \(L\) consisted of two strings, say \(\{a, abba\}\), we could specify a probability distribution by specifying a probability for each string, such that all of the probabilities add up to \(1\). For examples, we could specify that \(p(a)=0.5\) and \(p(abba)=0.5\). We can picture this with a barplot.

(barplot '(a abba)
'(0.5 0.5))

Alternatively we could have specified that \(p(a)=0.7\) and \(p(abba)=0.3\) giving the following barplot.

(barplot '(a abba) '(0.7 0.3))

There are, of course, an uncountably infinite number of different probability distributions which we can specify on these two strings., each of which is parameterized by a probability vector. A probability vector is just a list of numbers where all of the entries are numbers between zero and one and all of the entries add up to one. We will write probability vectors as \(\vec{\theta}=\langle 0.7, 0.3\rangle\).

As we saw in the last chapter, we can characterize a non-probabilistic languge using a recognizer which is a predicate \(p(s)\) which returns true for the strings in the language and false otherwise. When we think of a probability distribution over strings, it is also convenient to think about the strings that are not in the language (that is, strings in \(V^*-L\)) as having probability \(0\). A probabilistic language generalizes this so that strings outside the language have \(0\) probability but strings in the language can have different probabilities which measure their prototypicality, commonness, or frequency.

Why are probabilistic languages useful? By adding probabilities to our formal languages, we gain several benefits.

We will look at each of these aspects of probabilistic languages in this course. First, let’s talk a bit more about specifying probabilistic formal languages.

Computational Models of Formal Languages

When a language is finite, we can specify its distribution just using a list specifying one probability for each string. This is similar to the extensional definition of a set.

\(s\) \(p(s)\)
a 0.6
baab 0.1
ab 0.3

A fixed-size discrete probability distribution can be parameterized with a probability vector.

What about when our language is infinite, like in the case of \((ab)^*\) or English, for that matter? In the preceding lecture, we discussed computer programs that could characterize formal languages intensionally using recursion. We can also do similar things with probabilistic formal languages.

Samplers

One way of specifying the probability of some corpus is by defining a probabilistic generative model. This is simply a generator whose is behavior is random; that is, it is a model which samples strings from some probability distribution. Recall our definition of a generator for the language \((ab)^*\).

(defn generate-ab* [n]
  (if (= n 0)
      '()
      (concat 
        '(a b) 
        (generate-ab*  (- n 1)))))

(generate-ab*  10)

Recall that in this code we left \(n\) unspecified, a nondeterministic choice. One way to define a probability distribution is to replace nondeterministic choice with random choice. How can we do this? First, let’s define a procedure called flip which flips a fair coin.

(defn flip 
  []
  (if (< (rand) 0.5) 
    true 
    false))
(flip)

flip defines a distribution over a fair coin. We can look at its behavior by examining a histogram of many tosses of this coin. Note that the function flip takes no inputs. Functions like this have a special name in functional programming, they are called thunks.

(hist (repeatedly 100000 flip))

Using flip we can define a function that samples from the non-negative integers like so.

(defn sample-n []
  (if (flip)
	0
    (+ 1 (sample-n))))

(sample-n)

It is also useful to visualize this distribution using a histogram, to get a feel for how it spreads its probability mass over the natural numbers \(\mathbb{N}\).

(hist (repeatedly 10000 sample-n))

Using sample-n we can now draw random samples from the language \((ab)^*\).

(defn sample-ab*  [] (generate-ab*  (sample-n)))
(sample-ab*)

We can also use the same idea directly to define sample-ab* by replacing the stopping condition with a random coin flip.

(defn sample-ab* []
  (if (flip)
      '()
      (concat '(a b) (sample-ab*))))

(sample-ab*)

Note that we have been using names that begin with sample-. Such a probabilistic generator is known as a sampler and we will often use this naming convention when we write them. Let’s look at the distribution over strings generated by the procedure sample-ab*.

(hist (repeatedly 10000 sample-ab*))

Let’s take a moment to consider the generative process which defines the distribution over this formal language. The generative process can be thought of as walking along the strings of the language, arranged in terms of length (i.e., \(\epsilon\), \(ab\), \(abab\), etc.) and flipping a coin, with probability \(q\) of heads, at each string. If the coin comes up true (heads) at string \((ab)^n\) then we stop and return that string. Otherwise, we continue repeating at string \((ab)^{n+1}\). Thus, the probability distribution over strings defined by this process is given by the following expression (recalling that \(1-q\) is the probability of not stopping at a particular \(n\)).

\[\begin{aligned} p(&&&\epsilon &&) &&=&& q &\\ p(&&& ab &&) &&=&& q\cdot(1-q)\\ p(&&& abab &&) &&=&& q\cdot(1-q)^{2}\\ p(&&& ababab &&) &&=&& q\cdot(1-q)^{3}\\ &&&\phantom{()}\vdots&&&&&& \phantom{q\cdot()}\vdots\\ \text{In general, } p(&&&(ab)^n &&) &&=&& q\cdot(1-q)^{n}\quad\text{for }n\in\{0,1,2,\ldots\} \end{aligned}\]

This is called a geometric distribution with parameter \(q\). A geometric distribution is a distribution on the natural numbers \(\mathbb{N}\) which is parameterized by a fixed success probability \(q\). In our example, \(q = 0.5\).

At this point, something may concern you. When we talked about distributions over finite numbers of outcomes above, we discussed the fact that the set of probabilities characterizing the distribution must sum to \(1\). But now we have defined the geometric distribution over \(\mathbb{N}\)—an infinite set! Can this infinite set of probabilities sum to \(1\)?

Recall that a series of the form

\[a + ar + ar^2 + ar^3 + ...\]

is called a geometric series. When \(|r|<1\), the sum of a such a geometric series is given by the following expression:

\[a + ar + ar^2 + ar^3 + ... = \sum_{k=0}^\infty ar^k = \frac{a}{1-r}, \ \text{for} |r|<1.\]

In geometric distributions, we set \(a=q\) and \(r=(1-q)\) and the terms of this sum represent the probabilities of the outcomes at each natural number \(n\).

\[q + q(1-q) + q(1-q)^2 + q(1-q)^3 + ... = \sum_{k=0}^\infty q(1-q)^k\]

So, the sum of this series is

\[\frac{q}{1-(1-q)} = \frac{q}{q} = 1\]

This total probability over all strings in \((ab)^*\) sums to \(1\), and thus we have defined a well-formed probability distribution over this infinite language.

Scorers

We can also write the probabilistic equivalent to recognizer for this language. Recall our original recognizer function.

(defn in-ab*? [str]
    (if (empty? str)
        true
        (if (prefix? '(a b) str)
            (in-ab*? (rest (rest str)))
            false)))

(in-ab*? '(a b a b))

What is the probabilistic equivalent of a recognizer? It is a function that returns not just whether a string is a language or not, but the probability of the string under the generative model. We call such functions scoring functions.

(defn score-ab* [str]
    (if (empty? str)
        0.5
        (if (prefix? '(a b) str)
            (* 0.5 (score-ab* (rest (rest str))))
            0)))

(score-ab* '(a b a b))

We will often use this naming convention score- for scoring functions. Note that what this function does is score a string under the generative model above, which means that it corresponds to one of the possible probability mass functions \(p(s)\) for assigning probabilities to the language \((ab)^*\). In this case, the motivation for our notation for probabilities in terms of probability mass functions becomes clearer, since we couldn’t list all of the infinite probabilities in our language.

What would the scorer look like if we wanted to exclude the empty string \(\epsilon\) from the language?


7 Infinite Languages and Recursion 9 Models and Data