We have now seen our first probability distributions over strings. We have in fact seen two different representations of the distributions—samplers and scorers. In general, there are many different kinds of probability distributions. In this part of the course, we will focus on discrete distributions, that is, probability distributions whose support consists of discrete elements like integers or symbols. As we have seen, in this setting the score function, \(p(s)\) is called a probability mass function. We can think of a discrete probability distribution as a pair \(\langle S, p \rangle\) consisting of a nonempty set \(S\), called the sample space, or set of outcomes, together with a probability mass function \(p\) which assigns a probability to each element of \(S\) such that for all \(s\) in \(S\), \(p(s) \in [0,1]\) and \(\sum_{s \in S} p(s) = 1\). The subset of outcomes which the probability mass function maps to nonzero values is called the support of the distribution. Note that these definitions will not be sufficient if our support is continuous (uncountable), and in that case we will need to make use of measure theory to give a well-defined notion of probability distribution. However, we won’t consider these complications in this course.

Bernoulli Random Variables

One of the most basic probability distributions is the Bernoulli distribution which can be thought of as representing a biased coin and is expressed by (flip weight) where weight is the probabilities of heads (or true). It is useful to introduce some additional terminology and notation at this point. First, a random variable is a mathematical object that takes on a value from some sample space \(S\) according to some probability mass function \(p\). For instance, some particular invocation of flip constitutes a random variable.

(defn flip [weight]
  (if (< (rand) weight)
      true
      false))
(def my-random-variable (flip 0.1))

my-random-variable

A random variable refers to a particular call of flip—and the value returned is the outcome of the random variable. While the procedure flip defines a distribution, the invocation of this procedure in (def my-random-variable (flip 0.5)) defines a particular random variable. In general, we will use the term random variable to refer to the use of a distribution in a particular place in a program while the distribution will refer to the abstract set of values together with their probabilities. Thus, in the following code sample, there are three random variables, all with the same distribution (a fair coin).

(list
(flip 0.5)
(flip 0.5)
(flip 0.5))

In general, once we have added random functions to our programming language (like flip) any expression can be a random variable.

Random variables can thus be more complex objects built from simpler parts, including other random variables.

(def my-random-variable 
  (list 
    (flip 0.5) 
    (flip 0.75) 
    (flip 0.5)))
my-random-variable

Here the random variable of interest is a list created by putting together three invocations of flip (each with a specified weight). Each particular invocation of flip in this list is also a random variable, as well. We will think of any expression built up of samplers and deterministic (non-random) functions as a random variable where the outcome or value sampled is the result of evaluating that expression. Most of the kinds of objects in which we will be interested in the course will be complex random variables built up of simpler ones like this.

Probability Notation

Since probability is a tool used across mathematics, engineering, and the sciences, there is an array of different notations and conventions used in different contexts. Here we introduce some of these.

When \(X\) is a random variable distributed according to some known distribution, such as Bernoulli, we often write the following.

\[X \sim \operatorname{Bernoulli}(\theta)\]

This is read \(X\) is distributed as a Bernoulli distribution with parameter (i.e., weight) \(\theta\). It is also sometimes read \(X\) is sampled from a Bernoulli distribution with parameter \(\theta\).

Or similarly if \(Y\) is distributed according to a geometric distribution with parameter \(q\) we write.

\[Y \sim \operatorname{Geometric}(q)\]

Another standard notation is to use \(\Pr(X=x)\) to refer to the probability that random variable \(X\) takes on value \(x\).

\[\Pr(X=\texttt{true}) = \theta\] \[\Pr(Y=5) = q \cdot (1-q)^5\]

Crucially, \(\Pr(\cdot)\) here should be thought of as a higher-order function which takes a logical predicate \(X=\texttt{true}\) or \(Y=5\) and returns the probability that that predicate is true. Often lower case \(p\) is used to refer to the probability mass functions associated with random variables, i.e.,

\[\Pr(X=\texttt{true}) = p_X(\texttt{true}) = \theta\] \[\Pr(Y=5) = p_Y(5) = q \cdot (1-q)^5\]

Sometimes we will write \(\Pr(X) = \{\Pr(X=x) \mid x \in \text{outcomes}~\text{of } X\}\) to refer to the whole distribution associated with random variable \(X\). Also, when it is clear from context which random variable we are referring to, we will sometimes write \(\Pr(x)\) instead of the explicit \(\Pr(X=x)\) or \(p_X(x)\) for the probability of random variable \(X\)’s having outcome \(x\).

Joint Distributions

Above, we gave the example of complex random variables built out of other simpler random variables.

(def my-random-variable (list (flip 0.5) (flip 0.5)))
my-random-variable

Here we are defining a random variable which is a list of random Bernoulli random variables. Complex random variables like this come up frequently in probability theory and we will need some special terminology and notation to deal with them. A distribution defined over several component random variables is known as a joint distribution in probability theory.

Consider the following code.

(repeatedly 10 (fn [] (flip 0.5)))

The value produced by the expression (repeatedly 10 flip) consists of ten calls to flip. This list consists of several component random variables, let’s use superscripts to index into the list like so \(X^{(1)}, X^{(2)},X^{(3)},\dots,X^{(10)}\). We can now talk about the joint distribution over these ten coin flips, which we write as follows.

\[\Pr(X^{(1)}, X^{(2)},X^{(3)},\dots,X^{(10)})\]

The joint distribution over a set of random variables is one of the most fundamental concepts in probabilistic modeling and we will return to it many times.

For discrete distributions, we can think of the sample space or support of the joint distribution as the Cartesian product of the sample spaces of all of the component random variables. So if \(\mathcal{B}\) is the set \(\{\texttt{true}, \texttt{false}\}\)—the support of each of the ten variables above—then the support of the joint distribution of all ten is \(\mathcal{B}^{10}\).

The Categorical Distribution

A categorical distribution is the name for a probability distribution over \(k\) discrete outcomes. It takes as a parameter a probability vector \(\vec{\theta}\), of length \(k\) specifying the probabilities of the \(k\) possible outcomes. As we saw in the last chapter, a probability vector \(\vec{\theta}\) is simply a vector whose elements are probabilities \(\theta_i \in [0,1]\) which sums to one, such as \(\langle 0.3, 0.2, 0.5 \rangle\). Of course, for this to make sense, we will have to choose an ordering over our \(k\) discrete outcomes so that each index in the vector \(i\) is associated with one of the possible outcomes.

If \(X\) is a random variable distributed as a categorical distribution with parameters \(\vec{\theta}\), we write.

\[X \sim \mathrm{categorical}(\vec{\theta})\]

An example of a categorical distribution is a fair, six-sided die. Whenever you think of a categorical distribution over \(k\) outcomes, it is useful to think of it as a \(k\)-sided, biased die.

Sampling from a Categorical Distribution

How can we write a sampler for categorical distributions in our language? First, it is useful to introduce a function that normalizes a list of numbers.

(defn normalize [params]
  (let [sum (apply + params)]
    (map (fn [x] (/ x sum)) params)))

(let [example-weights (list 2 1 1)]
  (normalize example-weights))

We have defined a function called normalize which takes a list of numbers and normalizes them, i.e. scales the elements so that they add to \(1\). This function uses two programming ideas that we haven’t seen used very much yet in the course.

The first is the let-statement. Recall that the let-statement gives us a local variable binding and is used like this:

(let [var1 expression1
      var2 expression2
      ...  ...        ]
  body)

Here var1 etc. are symbols that are bound to the values of the corresponding expressions. These variables can then be used in the body of the let statement, whose overall value is the result of evaluating the body. The variables bound within a let statement local—they are only available inside the let statement. Note this is unlike variables bound with def, which are available globally.

The second is the function apply which takes a function and a list of parameters to that function and applies the function to the list like so: (apply + params) in the code above.

Why do we need to use apply in the function definition?

Using these definitions we can define a sampler for categorical distributions.

(defn sample-categorical [outcomes probs]
  (if (flip (first probs))
    (first outcomes)
    (sample-categorical (rest outcomes)
                        (normalize (rest probs)))))

(sample-categorical
 (list 'Ishmael 'call 'me)
 (list 0.5 0.25 0.25))

Let’s have a look at the resulting distribution.

(hist 
 (repeatedly 
  10000 
  (fn [] 
    (sample-categorical
     (list 'Ishmael 'call 'me)
     (list 0.5 0.25 0.25)))))

The idea behind this function is similar to the idea behind the geometric distribution that we saw earlier. We walk through each of the elements of the list of possible values \(v_i\), flipping a coin to see if we should stop there. So the function sample-categorical flips a coin with the weight of the first object to see if it should return that first object. If the coin comes up true it returns that object. Otherwise, it recurses on the rest of the objects, critically before doing so it renormalizes the remaining probabilities in (rest probabilities), to make it a proper probability vector.

There are two ways to understand why we have to renormalize the remaining probabilities before each call. The first is to recognize that the categorical sampler always needs a properly normalized distribution that sums to \(1\). Thus, to satisfy the contract, we must renormalize the remaining probabilities before passing them in. A second way to understand this, which we will explore in a later unit, is that once we have decided not to return some value, we need to sample from the conditional distribution conditioned on not choosing that value. We will see what this means later.

Scoring a Categorical Distribution

Now we have seen how to sample from a categorical distribution, how can we score a sample from a categorical distribution? In some sense, scoring a categorical distribution is trivial, since if we see an observation which is an instance of outcome \(i\) in the support of the categorical, it’s probability is just \(\theta_i\), that is, the probability of outcome \(i\) in the parameter vector of the distribution.

(def vocabulary (list 'call 'me 'Ishmael))
(def probabilities (list (/ 1 3) (/ 1 3 ) (/ 1 3 )))

(defn score-categorical [outcome outcomes probs]
  (if (empty? probs)
    (throw "no matching outcome")
    (if (= outcome (first outcomes))
      (first probs)
      (score-categorical outcome (rest outcomes) (rest probs)))))

(score-categorical 'me vocabulary probabilities)

Log Probabilities

When working with probability distributions in practice, we often need to compute products with many terms. Since probabilities are numbers between \(0\) and \(1\)—in fact, often very close to \(0\)—we can quickly get underflow errors. As numbers get very close to zero, they can drop below the level of precision provided by standard floating point representations. For this reason, we typically work with log probabilities.

Log probabilities are simply the logarithms of probabilities. In this course (and in probabilistic modeling in general), we will usually use the logarithm with base \(2\). Recall that the logarithm \(\log_2(x)\) is the number of times that you would need to multiply \(2\) by itself to get \(x\). Since probabilities are numbers between \(0\) and \(1\) then we will need to raise \(2\) to a negative power to “reach” the target probability \(x\)—that is, we seek a \(q\) such that \(x=2^{-q}\) or equivalently \(x=(\frac{1}{2})^q\). Thus, log probabilities are numbers between \(-\infty\), that is, \(\log(0)\) and \(0\), that is, \(\log(1)\).

Logarithms are useful because they give us a new way of representing numbers—in terms of exponentiation of a base. Instead of thinking about a number \(x\) in terms of the number of times we need to add \(1\), \(10\), \(100\) etc. to reach \(x\), we think about the number of times you have to multiple \(2\) by itself to reach \(x\). Because of this, addition in the log domain corresponds to multiplication of probabilities.

\[\log(p_1 \times p_2 \times p_3) = \log(p_1) + \log(p_2) + \log(p_3)\]

Since we are often computing products in probability theory, it is often useful to work in the log domain and, instead, compute sums.

Since the logarithms of very small probabilities (i.e., probabilities close to \(0\)) are negative numbers with very large absolute values, they help us to avoid underflow errors.

There is one disadvantage, however, which is that to add probabilities, if we’ve represented them in log-space, we have no choice but to exponentiate first, then add, then take logs again. We usually define a function called logsumexp to do this.

(defn log2 [x] (Math/log2 x))
(defn exp2 [z] (Math/pow 2 z))

(defn logsumexp [& log-vals]
  (log2 
   (apply 
    + 
    (map exp2 log-vals))))

(logsumexp (log2 0.5) (log2 0.5))


In this example, we have used a new Clojure notation, the & in the function definition. This notation tells Clojure that it should take all of the arguments passed into a call like (logsumexp -10 -12) and make them into a single list and bind this list to the variable called log-vals.

((fn [& args] ... ) 1 2 3)
; is the same as
((fn [args]   ... ) (list 1 2 3))

The function logsumexp first exponentiates all of its arguments, turning them back into ordinary probabilities, then adds them, and returns the logarithm of the result. Of course, if any of our log probabilities were very large in absolute value, we might still get underflow errors. One way of dealing with this is to implement a “safe” version of logsumexp as in the code listing below.

(defn exp2 [z] (Math/pow 2 z))

(defn logsumexp [& log-vals]
  (let [mx (apply max log-vals)]
    (+ mx
       (log2
        (apply 
         +
         (map exp2 
              (map (fn [x] (- x mx))
                   log-vals)))))))

(logsumexp -1 -1)

Can you tell what is being done here to avoid underflow?


9 Models and Data 11 Bag-of-Words Models