8 Distributions over Languages
(defn prefix? [pr str]
(if (> (count pr) (count str))
false
(if (empty? pr)
true
(if (= (first pr) (first str))
(prefix? (rest pr) (rest str))
false))))
We have now seen how we can use formal languages, or sets of strings, as a model of natural language structure. We have also introduced the idea of describing natural languages in terms of computational processes such as recognizers and generators. With these tools in hand, we now have a framework in which we can plausibly begin to build some models of natural language which are computationally precise.
Before going further, it is useful to consider one more extension of our formal language theoretic model. In our basic model so far, a string is either in a formal language \(L\) or outside of the formal language \(L\), that is, in \((V^* - L)\), but we have no measure distinguishing how prototypical, frequent, or common a string is within a language language. It is useful, for these reasons and others, to introduce the idea of a probability distribution over a formal language or a probabilistic formal language.
A probabilistic formal language is just a formal language where each string \(s\) is associated with a probability \(p(s)\). The function \(p(s)\) satisfies two constraints. First, every probability is a number between \(0\) and \(1\), that is, \(p(s) \in [0,1]\). Second, summing \(p(s)\) over all elements in \(L\) gives \(1\), that is, \(\sum_{s \in L} p(s) = 1\). We refer to the function \(p\) that returns the probability of each string as a probability mass function—this is the term used for such functions when defined on discrete sets. We will talk more about probability mass functions later in the book when we introduce random variables.
If our language \(L\) consisted of two strings, say \(\{a, abba\}\), we could specify a probability distribution by specifying a probability for each string, such that all of the probabilities add up to \(1\). For examples, we could specify that \(p(a)=0.5\) and \(p(abba)=0.5\). We can picture this with a barplot.
(barplot '(a abba)
'(0.5 0.5))
Alternatively we could have specified that \(p(a)=0.7\) and \(p(abba)=0.3\) giving the following barplot.
(barplot '(a abba) '(0.7 0.3))
There are, of course, an uncountably infinite number of different probability distributions which we can specify on these two strings., each of which is parameterized by a probability vector. A probability vector is just a list of numbers where all of the entries are numbers between zero and one and all of the entries add up to one. We will write probability vectors as \(\vec{\theta}=\langle 0.7, 0.3\rangle\).
As we saw in the last chapter, we can characterize a non-probabilistic
languge using a recognizer which is a predicate \(p(s)\) which
returns true
for the strings in the language and false
otherwise.
When we think of a probability distribution over strings, it is also
convenient to think about the strings that are not in the language
(that is, strings in \(V^*-L\)) as having probability \(0\). A probabilistic
language generalizes this so that strings outside the language have
\(0\) probability but strings in the language can have different
probabilities which measure their prototypicality, commonness, or
frequency.
Why are probabilistic languages useful? By adding probabilities to our formal languages, we gain several benefits.
- We are able to model aspects of real-world datasets that aren’t necessarily directly related to well-formedness, such as frequency or prototypicality.
- We also gain a related notion of goodness of fit of a probabilistic language to a real world dataset. We can use probability to measure how well a model explains some real data.
- Combined with other principles of inductive inference such as the rules of probabilistic inference we can derive ways of selecting good models. These methods can be used to model language learning and language processing.
We will look at each of these aspects of probabilistic languages in this course. First, let’s talk a bit more about specifying probabilistic formal languages.
Computational Models of Formal Languages
When a language is finite, we can specify its distribution just using a list specifying one probability for each string. This is similar to the extensional definition of a set.
\(s\) | \(p(s)\) |
---|---|
a | 0.6 |
baab | 0.1 |
ab | 0.3 |
A fixed-size discrete probability distribution can be parameterized with a probability vector.
What about when our language is infinite, like in the case of \((ab)^*\) or English, for that matter? In the preceding lecture, we discussed computer programs that could characterize formal languages intensionally using recursion. We can also do similar things with probabilistic formal languages.
Samplers
One way of specifying the probability of some corpus is by defining a probabilistic generative model. This is simply a generator whose is behavior is random; that is, it is a model which samples strings from some probability distribution. Recall our definition of a generator for the language \((ab)^*\).
(defn generate-ab* [n]
(if (= n 0)
'()
(concat
'(a b)
(generate-ab* (- n 1)))))
(generate-ab* 10)
Recall that in this code we left \(n\) unspecified, a
nondeterministic choice. One way to define a probability
distribution is to replace nondeterministic choice with random
choice. How can we do this? First, let’s define a procedure called
flip
which flips a fair coin.
(defn flip
[]
(if (< (rand) 0.5)
true
false))
(flip)
flip
defines a distribution over a fair coin. We can look at its
behavior by examining a histogram of many tosses of this coin. Note
that the function flip
takes no inputs. Functions like this have a
special name in functional programming, they are called thunks.
(hist (repeatedly 100000 flip))
Using flip
we can define a function that samples from the
non-negative integers like so.
(defn sample-n []
(if (flip)
0
(+ 1 (sample-n))))
(sample-n)
It is also useful to visualize this distribution using a histogram, to get a feel for how it spreads its probability mass over the natural numbers \(\mathbb{N}\).
(hist (repeatedly 10000 sample-n))
Using sample-n
we can now draw random samples from the
language \((ab)^*\).
(defn sample-ab* [] (generate-ab* (sample-n)))
(sample-ab*)
We can also use the same idea directly to define sample-ab*
by replacing the stopping condition with a random coin flip.
(defn sample-ab* []
(if (flip)
'()
(concat '(a b) (sample-ab*))))
(sample-ab*)
Note that we have been using names that begin with
sample-
. Such a probabilistic generator is known as a
sampler and we will often use this naming convention when we write
them. Let’s look at the distribution over strings generated by the
procedure sample-ab*
.
(hist (repeatedly 10000 sample-ab*))
Let’s take a moment to consider the generative process which defines
the distribution over this formal language. The generative process can
be thought of as walking along the strings of the language, arranged
in terms of length (i.e., \(\epsilon\), \(ab\), \(abab\), etc.) and
flipping a coin, with probability \(q\) of heads, at each string. If
the coin comes up true
(heads) at string \((ab)^n\) then we stop and
return that string. Otherwise, we continue repeating at string
\((ab)^{n+1}\). Thus, the probability distribution over strings defined by
this process is given by the following expression (recalling that
\(1-q\) is the probability of not stopping at a particular \(n\)).
This is called a geometric distribution with parameter \(q\). A geometric distribution is a distribution on the natural numbers \(\mathbb{N}\) which is parameterized by a fixed success probability \(q\). In our example, \(q = 0.5\).
At this point, something may concern you. When we talked about distributions over finite numbers of outcomes above, we discussed the fact that the set of probabilities characterizing the distribution must sum to \(1\). But now we have defined the geometric distribution over \(\mathbb{N}\)—an infinite set! Can this infinite set of probabilities sum to \(1\)?
Recall that a series of the form
\[a + ar + ar^2 + ar^3 + ...\]is called a geometric series. When \(|r|<1\), the sum of a such a geometric series is given by the following expression:
\[a + ar + ar^2 + ar^3 + ... = \sum_{k=0}^\infty ar^k = \frac{a}{1-r}, \ \text{for} |r|<1.\]In geometric distributions, we set \(a=q\) and \(r=(1-q)\) and the terms of this sum represent the probabilities of the outcomes at each natural number \(n\).
\[q + q(1-q) + q(1-q)^2 + q(1-q)^3 + ... = \sum_{k=0}^\infty q(1-q)^k\]So, the sum of this series is
\[\frac{q}{1-(1-q)} = \frac{q}{q} = 1\]This total probability over all strings in \((ab)^*\) sums to \(1\), and thus we have defined a well-formed probability distribution over this infinite language.
Scorers
We can also write the probabilistic equivalent to recognizer for this language. Recall our original recognizer function.
(defn in-ab*? [str]
(if (empty? str)
true
(if (prefix? '(a b) str)
(in-ab*? (rest (rest str)))
false)))
(in-ab*? '(a b a b))
What is the probabilistic equivalent of a recognizer? It is a function that returns not just whether a string is a language or not, but the probability of the string under the generative model. We call such functions scoring functions.
(defn score-ab* [str]
(if (empty? str)
0.5
(if (prefix? '(a b) str)
(* 0.5 (score-ab* (rest (rest str))))
0)))
(score-ab* '(a b a b))
We will often use this naming convention score-
for scoring
functions. Note that what this function does is score a string under
the generative model above, which means that it corresponds to one of
the possible probability mass functions \(p(s)\) for assigning
probabilities to the language \((ab)^*\). In this case, the motivation
for our notation for probabilities in terms of probability mass
functions becomes clearer, since we couldn’t list all of the infinite
probabilities in our language.
What would the scorer look like if we wanted to exclude the empty string \(\epsilon\) from the language?