10 Discrete Random Variables
(defn prefix? [pr str]
(if (> (count pr) (count str))
false
(if (empty? pr)
true
(if (= (first pr) (first str))
(prefix? (rest pr) (rest str))
false))))
(defn sample-ab* []
(if (flip)
'()
(concat '(a b) (sample-ab*))))
(defn score-ab* [str]
(if (empty? str)
0.5
(if (prefix? '(a b) str)
(* 0.5 (score-ab* (rest (rest str))))
0)))
(defn flip []
(if (< (rand) 0.5)
true
false))
We have now seen our first probability distributions over strings. We have in fact seen two different representations of the distributions—samplers and scorers. In general, there are many different kinds of probability distributions. In this part of the course, we will focus on discrete distributions, that is, probability distributions whose support consists of discrete elements like integers or symbols. As we have seen, in this setting the score function, \(p(s)\) is called a probability mass function. We can think of a discrete probability distribution as a pair \(\langle S, p \rangle\) consisting of a nonempty set \(S\), called the sample space, or set of outcomes, together with a probability mass function \(p\) which assigns a probability to each element of \(S\) such that for all \(s\) in \(S\), \(p(s) \in [0,1]\) and \(\sum_{s \in S} p(s) = 1\). The subset of outcomes which the probability mass function maps to nonzero values is called the support of the distribution. Note that these definitions will not be sufficient if our support is continuous (uncountable), and in that case we will need to make use of measure theory to give a well-defined notion of probability distribution. However, we won’t consider these complications in this course.
Bernoulli Random Variables
One of the most basic probability distributions is the Bernoulli
distribution which can be thought of as representing a biased coin
and is expressed by (flip weight)
where weight
is the
probabilities of heads (or true
). It is useful to introduce some
additional terminology and notation at this point. First, a random
variable is a mathematical object that takes on a value from some
sample space \(S\) according to some probability mass function \(p\).
For instance, some particular invocation of flip
constitutes a
random variable.
(defn flip [weight]
(if (< (rand) weight)
true
false))
(def my-random-variable (flip 0.1))
my-random-variable
A random variable refers to a particular call of flip
—and the
value returned is the outcome of the random variable. While the
procedure flip
defines a distribution, the invocation of this
procedure in (def my-random-variable (flip 0.5))
defines a
particular random variable. In general, we will use the term random
variable to refer to the use of a distribution in a particular
place in a program while the distribution will refer to the abstract
set of values together with their probabilities. Thus, in the
following code sample, there are three random variables, all with the
same distribution (a fair coin).
(list
(flip 0.5)
(flip 0.5)
(flip 0.5))
In general, once we have added random functions to our programming
language (like flip
) any expression can be a random variable.
Random variables can thus be more complex objects built from simpler parts, including other random variables.
(def my-random-variable
(list
(flip 0.5)
(flip 0.75)
(flip 0.5)))
my-random-variable
Here the random variable of interest is a list created by putting
together three invocations of flip
(each with a specified weight
).
Each particular invocation of flip
in this list is also a random
variable, as well. We will think of any expression built up of
samplers and deterministic (non-random) functions as a random variable
where the outcome or value sampled is the result of evaluating
that expression. Most of the kinds of objects in which we will be
interested in the course will be complex random variables built up of
simpler ones like this.
Probability Notation
Since probability is a tool used across mathematics, engineering, and the sciences, there is an array of different notations and conventions used in different contexts. Here we introduce some of these.
When \(X\) is a random variable distributed according to some known distribution, such as Bernoulli, we often write the following.
\[X \sim \operatorname{Bernoulli}(\theta)\]This is read \(X\) is distributed as a Bernoulli distribution with
parameter (i.e., weight
) \(\theta\). It is also sometimes read
\(X\) is sampled from a Bernoulli distribution with parameter
\(\theta\).
Or similarly if \(Y\) is distributed according to a geometric distribution with parameter \(q\) we write.
\[Y \sim \operatorname{Geometric}(q)\]Another standard notation is to use \(\Pr(X=x)\) to refer to the probability that random variable \(X\) takes on value \(x\).
\[\Pr(X=\texttt{true}) = \theta\] \[\Pr(Y=5) = q \cdot (1-q)^5\]Crucially, \(\Pr(\cdot)\) here should be thought of as a higher-order function which takes a logical predicate \(X=\texttt{true}\) or \(Y=5\) and returns the probability that that predicate is true. Often lower case \(p\) is used to refer to the probability mass functions associated with random variables, i.e.,
\[\Pr(X=\texttt{true}) = p_X(\texttt{true}) = \theta\] \[\Pr(Y=5) = p_Y(5) = q \cdot (1-q)^5\]Sometimes we will write \(\Pr(X) = \{\Pr(X=x) \mid x \in \text{outcomes}~\text{of } X\}\) to refer to the whole distribution associated with random variable \(X\). Also, when it is clear from context which random variable we are referring to, we will sometimes write \(\Pr(x)\) instead of the explicit \(\Pr(X=x)\) or \(p_X(x)\) for the probability of random variable \(X\)’s having outcome \(x\).
Joint Distributions
Above, we gave the example of complex random variables built out of other simpler random variables.
(def my-random-variable (list (flip 0.5) (flip 0.5)))
my-random-variable
Here we are defining a random variable which is a list of random Bernoulli random variables. Complex random variables like this come up frequently in probability theory and we will need some special terminology and notation to deal with them. A distribution defined over several component random variables is known as a joint distribution in probability theory.
Consider the following code.
(repeatedly 10 (fn [] (flip 0.5)))
The value produced by the expression (repeatedly 10 flip)
consists of ten calls to flip
. This list consists of several
component random variables, let’s use superscripts to index into the
list like so \(X^{(1)}, X^{(2)},X^{(3)},\dots,X^{(10)}\). We can now
talk about the joint distribution over these ten coin flips, which we
write as follows.
The joint distribution over a set of random variables is one of the most fundamental concepts in probabilistic modeling and we will return to it many times.
For discrete distributions, we can think of the sample space or support of the joint distribution as the Cartesian product of the sample spaces of all of the component random variables. So if \(\mathcal{B}\) is the set \(\{\texttt{true}, \texttt{false}\}\)—the support of each of the ten variables above—then the support of the joint distribution of all ten is \(\mathcal{B}^{10}\).
The Categorical Distribution
A categorical distribution is the name for a probability distribution over \(k\) discrete outcomes. It takes as a parameter a probability vector \(\vec{\theta}\), of length \(k\) specifying the probabilities of the \(k\) possible outcomes. As we saw in the last chapter, a probability vector \(\vec{\theta}\) is simply a vector whose elements are probabilities \(\theta_i \in [0,1]\) which sums to one, such as \(\langle 0.3, 0.2, 0.5 \rangle\). Of course, for this to make sense, we will have to choose an ordering over our \(k\) discrete outcomes so that each index in the vector \(i\) is associated with one of the possible outcomes.
If \(X\) is a random variable distributed as a categorical distribution with parameters \(\vec{\theta}\), we write.
\[X \sim \mathrm{categorical}(\vec{\theta})\]An example of a categorical distribution is a fair, six-sided die. Whenever you think of a categorical distribution over \(k\) outcomes, it is useful to think of it as a \(k\)-sided, biased die.
Sampling from a Categorical Distribution
How can we write a sampler for categorical distributions in our language? First, it is useful to introduce a function that normalizes a list of numbers.
(defn normalize [params]
(let [sum (apply + params)]
(map (fn [x] (/ x sum)) params)))
(let [example-weights (list 2 1 1)]
(normalize example-weights))
We have defined a function called normalize
which takes a list of
numbers and normalizes them, i.e. scales the elements so that they add to
\(1\). This function uses two programming ideas that we haven’t seen
used very much yet in the course.
The first is the let
-statement. Recall that the let
-statement
gives us a local variable binding and is used like this:
(let [var1 expression1
var2 expression2
... ... ]
body)
Here var1
etc. are symbols that are bound to the values of the
corresponding expressions. These variables can then be used in the
body of the let
statement, whose overall value is the result of
evaluating the body. The variables bound within a let
statement
local—they are only available inside the let statement. Note this
is unlike variables bound with def
, which are available globally.
The second is the function apply
which takes a function and a
list of parameters to that function and applies the function to the
list like so: (apply + params)
in the code above.
Why do we need to use
apply
in the function definition?
Using these definitions we can define a sampler for categorical distributions.
(defn sample-categorical [outcomes probs]
(if (flip (first probs))
(first outcomes)
(sample-categorical (rest outcomes)
(normalize (rest probs)))))
(sample-categorical
(list 'Ishmael 'call 'me)
(list 0.5 0.25 0.25))
Let’s have a look at the resulting distribution.
(hist
(repeatedly
10000
(fn []
(sample-categorical
(list 'Ishmael 'call 'me)
(list 0.5 0.25 0.25)))))
The idea behind this function is similar to the idea behind the
geometric distribution that we saw earlier. We walk through each of
the elements of the list of possible values \(v_i\), flipping a coin
to see if we should stop there. So the function sample-categorical
flips a coin with the weight of the first object to see if it should
return that first object. If the coin comes up true
it returns that
object. Otherwise, it recurses on the rest of the objects, critically
before doing so it renormalizes the remaining probabilities in
(rest probabilities)
, to make it a proper probability vector.
There are two ways to understand why we have to renormalize the remaining probabilities before each call. The first is to recognize that the categorical sampler always needs a properly normalized distribution that sums to \(1\). Thus, to satisfy the contract, we must renormalize the remaining probabilities before passing them in. A second way to understand this, which we will explore in a later unit, is that once we have decided not to return some value, we need to sample from the conditional distribution conditioned on not choosing that value. We will see what this means later.
Scoring a Categorical Distribution
Now we have seen how to sample from a categorical distribution, how can we score a sample from a categorical distribution? In some sense, scoring a categorical distribution is trivial, since if we see an observation which is an instance of outcome \(i\) in the support of the categorical, it’s probability is just \(\theta_i\), that is, the probability of outcome \(i\) in the parameter vector of the distribution.
(def vocabulary (list 'call 'me 'Ishmael))
(def probabilities (list (/ 1 3) (/ 1 3 ) (/ 1 3 )))
(defn score-categorical [outcome outcomes probs]
(if (empty? probs)
(throw "no matching outcome")
(if (= outcome (first outcomes))
(first probs)
(score-categorical outcome (rest outcomes) (rest probs)))))
(score-categorical 'me vocabulary probabilities)
Log Probabilities
When working with probability distributions in practice, we often need to compute products with many terms. Since probabilities are numbers between \(0\) and \(1\)—in fact, often very close to \(0\)—we can quickly get underflow errors. As numbers get very close to zero, they can drop below the level of precision provided by standard floating point representations. For this reason, we typically work with log probabilities.
Log probabilities are simply the logarithms of probabilities. In this course (and in probabilistic modeling in general), we will usually use the logarithm with base \(2\). Recall that the logarithm \(\log_2(x)\) is the number of times that you would need to multiply \(2\) by itself to get \(x\). Since probabilities are numbers between \(0\) and \(1\) then we will need to raise \(2\) to a negative power to “reach” the target probability \(x\)—that is, we seek a \(q\) such that \(x=2^{-q}\) or equivalently \(x=(\frac{1}{2})^q\). Thus, log probabilities are numbers between \(-\infty\), that is, \(\log(0)\) and \(0\), that is, \(\log(1)\).
Logarithms are useful because they give us a new way of representing numbers—in terms of exponentiation of a base. Instead of thinking about a number \(x\) in terms of the number of times we need to add \(1\), \(10\), \(100\) etc. to reach \(x\), we think about the number of times you have to multiple \(2\) by itself to reach \(x\). Because of this, addition in the log domain corresponds to multiplication of probabilities.
\[\log(p_1 \times p_2 \times p_3) = \log(p_1) + \log(p_2) + \log(p_3)\]Since we are often computing products in probability theory, it is often useful to work in the log domain and, instead, compute sums.
Since the logarithms of very small probabilities (i.e., probabilities close to \(0\)) are negative numbers with very large absolute values, they help us to avoid underflow errors.
There is one disadvantage, however, which is that to add
probabilities, if we’ve represented them in log-space, we have no
choice but to exponentiate first, then add, then take logs again. We
usually define a function called logsumexp
to do this.
(defn log2 [x] (Math/log2 x))
(defn exp2 [z] (Math/pow 2 z))
(defn logsumexp [& log-vals]
(log2
(apply
+
(map exp2 log-vals))))
(logsumexp (log2 0.5) (log2 0.5))
In this example, we have used a new Clojure notation, the &
in the function definition. This notation tells Clojure that it should
take all of the arguments passed into a call like (logsumexp -10
-12)
and make them into a single list and bind this list to
the variable called log-vals
.
((fn [& args] ... ) 1 2 3)
; is the same as
((fn [args] ... ) (list 1 2 3))
The function logsumexp
first exponentiates all of its arguments,
turning them back into ordinary probabilities, then adds them, and
returns the logarithm of the result. Of course, if any of our log
probabilities were very large in absolute value, we might still get
underflow errors. One way of dealing with this is to implement a
“safe” version of logsumexp
as in the code listing below.
(defn exp2 [z] (Math/pow 2 z))
(defn logsumexp [& log-vals]
(let [mx (apply max log-vals)]
(+ mx
(log2
(apply
+
(map exp2
(map (fn [x] (- x mx))
log-vals)))))))
(logsumexp -1 -1)
Can you tell what is being done here to avoid underflow?