What sorts of formal languages can a bag-of-words model generate in principle? It is fairly easy to see that if the categorical distribution in one of our bag-of-words models places probability mass on all of the elements of the vocabulary \(V\), then it places mass on all of the elements of \(V^*\). Thus, if a BOW model can generate a sequence like colorless green ideas sleep furiously, then it can also generate every permutation of that sequence, including ungrammatical permutations like green sleep colorless ideas furiously.

This fact is a simple consequence of the independence assumptions made by the bag-of-words model. To fully understand the notion of independence, we will need to discuss several important concepts of probability theory in more depth. In this unit, we do this, discussing joint distributions in more depth, and introducing marginal distributions, conditional distributions, and the chain rule and discussing how each of these relate to the concept of statistical independence.

Joint Distributions

The distribution defined by a bag-of-words sampler for a corpus placed probability mass on compound or complex structures consisting of sets of strings, each of which was a sequence of words.

Let’s call the random variable representing the \(i\)th word of the \(j\)th sentence of a corpus \(W^{(j,i)}\). As discussed earlier, we often wish to think about the joint distribution over a set of random variables such as those representing the words in a corpus:

\[\begin{aligned} \Pr(&W^{(1,1)}, W^{(1,2)}, \dots &&W^{(2,1)}, W^{(2,2)} \dots &&W^{(|\mathbf{C}|,|s_{|\mathbf{C}|}|)} =& \\ &w^{(1,1)}, w^{(1,2)}, \dots &&w^{(2,1)}, w^{(2,2)} \dots &&w^{(|\mathbf{C}|,|s_{|\mathbf{C}|}|)} &) \end{aligned}\]

The joint distribution over a set of random variables is typically the most fundamental object defined by a particular probabilistic model, and we will return to it many times.

What exactly is a joint distribution? Recall that we have defined a distribution as consisting of a sample space together with a probability mass function. Thus, we need two ingredients to specify our joint distribution. First we need to specify our sample space or support. The support of the joint distribution \(\Pr(X_1, \dots, X_n)\) over discrete random variables is the Cartesian product of the supports of the individual variables. In other words, it is the set of tuples of the form \(\langle X_1=x_1, \dots, X_n=x_n \rangle\) (note that we are considering strings tuples here, for the purposes of describing joint distributions over corpora).

What is the probability mass function of the joint distribution? In general, the probability mass function for a joint distribution over discrete random variables needs to specify a probability for every combination of values possible for the random variables in the distribution (minus 1). For example, let’s assume that our vocabulary is defined over the set \(V=\{\textit{Call}, \textit{me}, \textit{Ishmael}\}\) and that we wish to consider the joint distribution over strings with two words in them.

\[\Pr(W^{(1)}, W^{(2)} = w^{(1)}, w^{(2)})\]

With vocabulary \(V\), there are nine possible strings of length two. Thus, the full probability mass function \(p\) can potentially involve specifying eight probabilities (nine minus one).

It is often convenient to visualize joint distributions (or, more generally, sets of tuples) as a table with one dimension for each random variable (or coordinate of the tuple).

\({}_{\Large\mathbf{W}^{(1)}}\diagdown {}^{\Large\mathbf{W}^{(2)}}\) Call me Ishmael
Call \(p(\textit{Call}, \textit{Call})\) \(p(\textit{Call}, \textit{me})\) \(p(\textit{Call}, \textit{Ishmael})\)
me \(p(\textit{me}, \textit{Call})\) \(p(\textit{me}, \textit{me})\) \(p(\textit{me}, \textit{Ishmael})\)
Ishmael \(p(\textit{Ishmael}, \textit{Call})\) \(p(\textit{Ishmael}, \textit{me})\) \(p(\textit{Ishmael}, \textit{Ishmael})\)

Notice that the complexity of a joint distribution increases quickly as you add values to each random variable (i.e., vocabulary items/rows and columns in the table) and even more quickly as you add random variables (i.e., words in the corpus/dimensions in the table). For instance, for a vocabulary of size four, the set of strings over two words \(\Pr(W^{(1)},W^{(2)})\) could have a different probability for every single one of the \(16\) different strings. For strings over three words \(\Pr(W^{(1)},W^{(2)}, W^{(3)})\) with a vocabulary of size \(4\) there could be a different probability for each of the \(64\) possible strings, and so on.

Joint distributions are typically considered the fundamental object of interest in a probabilistic model since they encode the maximum possible amount of information about interactions between random variables. Joint distributions can be very complex and can require many parameters to specify completely. Because of this, practical probabilistic models almost always make simplifying assumptions that make the representation of the joint distribution less complex, and prevent us from having to specify a different probability for every value of every combination of random variables in the table. In particular, models make conditional independence assumptions about random variables. For example, in the case of the BOW model, the probability mass function is given by the product of the probabilities of each individual word.

\[\Pr(W^{(1,1)}, W^{(1,2)}, \dots) = \Pr(W^{(1,1)}) \cdot \Pr(W^{(1,2)}) \cdot \dots\]

The words in a bag-of-words model are statistically independent. Statistical independence is a fundamental concept in probabilistic modeling, and we will now consider it in more detail.

Independence

Intuitively, independence means that the values that one random variable takes on don’t affect our beliefs about the values that another random variable takes on. When sampling from one random variable we don’t know the value the other one has taken on, and don’t need to. Mathematically, this idea is captured in probability theory by the following rule: If \(A\) and \(B\) are independent random variables, then:

\(\Pr(A=a,B=b)=\Pr(A=a) \cdot \Pr(B=b)\) for all possible values of \(a\) and \(b\).

Assuming a bag-of-words model, we can use this rule to calculate the probability of each point or state in the joint distribution displayed in the table above. Let’s assume that our categorical distribution is again defined over the set \(\{\textit{Call}, \textit{me}, \textit{Ishmael}\}\) but is not uniform, and instead has the parameters \(\langle \frac{1}{2}, \frac{1}{4},\frac{1}{4} \rangle\). Our sample or outcome space is the space of length \(2\) sequences of words, such as \(\langle \textit{Call}\ \textit{me} \rangle\). We can now fill in the cells of the table representing the joint distribution over.

\[\Pr(W^{(1)}, W^{(2)} = w^{(1)}, w^{(2)})\]

With the independence assumptions of a bag-of-words model, this joint distribution becomes the following.

\({}_{\Large\mathbf{W}^{(1)}}\diagdown {}^{\Large\mathbf{W}^{(2)}}\) Call me Ishmael
Call \(\frac12\frac12=\frac14\) \(\frac12\frac14=\frac18\) \(\frac12\frac14=\frac18\)
me \(\frac14\frac12=\frac18\) \(\frac14\frac14=\frac1{16}\) \(\frac14\frac14=\frac1{16}\)
Ishmael \(\frac14\frac12=\frac18\) \(\frac14\frac14=\frac1{16}\) \(\frac14\frac14=\frac1{16}\)

Our law of independence says that \(\Pr(W^{(1,1)}, W^{(1,2)}, \dots) = \Pr(W^{(1,1)}) \cdot \Pr(W^{(1,2)}) \cdot \dots\). In the bag-of-words model \(\Pr(W^{(1,1)})\) and \(\Pr(W^{(1,2)})\) are specified using a categorical distribution over the vocabulary. But in general, what is the relationship between a joint distribution \(\Pr(A,B)\) and the distributions over its component random variables \(\Pr(A)\) and \(\Pr(B)\)? These are known as marginal distributions and the law of probability tell us how they must be systematically related to the joint.

Marginal Distributions

Marginal distributions like \(\Pr(A)\) and \(\Pr(B)\) are related to joint distribution \(\Pr(A,B)\) by one of the most important operations of probability theory: marginalization. Marginalization refers to the process of summing over all of the values of some of the random variables in a joint distribution.

Consider \(\Pr(W^{(2)}=\textit{Ishmael})\), the probability that the second word takes on the value Ishmael. What is this probability, according to the joint distribution \(\Pr(W^{(1)},W^{(2)})\)? One way we can think about this is that in the joint space \(\langle W^{(1)}, W^{(2)} \rangle\), there are three points or states where \(W^{(2)}=\textit{Ishmael}\):

  1. \(\langle W^{(1)}=\textit{Call}, W^{(2)}=\textit{Ishmael} \rangle\),
  2. \(\langle W^{(1)}=\textit{me}, W^{(2)}=\textit{Ishmael} \rangle\), and
  3. \(\langle W^{(1)}=\textit{Ishmael}, W^{(2)}=\textit{Ishmael} \rangle\).

So the total or marginal probability of \(W^{(2)}=\textit{Ishmael}\) is

\[\sum_{w \in W^{(1)}} \Pr(W^{(1)}=w, W^{(2)}=\textit{Ishmael})\]

In other words, the marginal probability adds up all of the probabilities in the joint distribution where \(W^{(2)}=\textit{Ishmael}\) without regard to the value of \(W^{(1)}\). We say that we marginalize over \(W^{(1)}\) or we marginalize \(W^{(1)}\) away. Marginalization is a way of getting rid of random variables in a joint distribution. The set of all marginal probabilities for values of \(W^{(2)}\) is called the marginal distribution over \(W^{(2)}\).

Why is this called a marginal distribution? If we think about the table above, the marginal probabilities over the two random variables, can be written in the margins of the table.

\({}_{\Large\mathbf{W}^{(1)}}\diagdown {}^{\Large\mathbf{W}^{(2)}}\) Call me Ishmael \(\Pr(\mathbf{W}^{(1)})\)
Call \(\frac14\) \(\frac18\) \(\frac18\) \(\frac12\)
me \(\frac18\) \(\frac1{16}\) \(\frac1{16}\) \(\frac14\)
Ishmael \(\frac18\) \(\frac1{16}\) \(\frac1{16}\) \(\frac14\)
\(\Pr(\mathbf{W}^{(2)})\) \(\frac12\) \(\frac14\) \(\frac14\)  

Thus, if we have access to the joint distribution in our model (e.g., \(\Pr(A,B)\)), and we want to implement a scorer for some marginal distribution (e.g., \(\Pr(A)\)), we must sum over all the joint states that contain the outcomes which we want to compute the probability of. If we have access to the joint distribution and we want to implement a sampler for the marginal, however, it is much easier. When we draw a sample, we simply need to “forget” the values of the random variables that we are marginalizing away.

Conditional Distributions

We have introduced joint distributions such as \(\Pr(A,B)\), marginal distributions such as \(\Pr(A)\) and \(\Pr(B)\), and the operation that relates them: marginalization. We now see that our definition of statistical independence is a law relating joint distributions and marginal distributions. Two random variables are statistically independent if their joint probabilities are always just the product of their marginal probabilities. Earlier we said that independence was meant to capture the intuitive idea that the two random variables did not “affect” one another. How does this law capture that?

To see this more clearly, it is useful to discuss another fundamental concept from probability theory: conditional distributions as well as the operation that relates them to the joint distribution, conditioning. We already seen one conditional probability a number of times in this course:

\[\Pr(\mathcal{C}=\mathbf{C} \mid \Theta= \vec{\theta})\]

This is the conditional probability of a corpus \(\mathbf{C}\) being generated by a model with parameter vector \(\vec{\theta}\). In this case, the parameter \(\vec{\theta})\) is fixed; however, we can also think of conditioning as a more general operation between random variables.

Conditioning can be thought of as a form of hypothetical reasoning where we ask about the probability distribution over some (set of) random variables, assuming some other random variables take on some particular values. For instance, we might ask what the probability distribution is over \(W^{(2)}\) given that \(W^{(1)}=\textit{Call}\). We write a conditional probability as \(\Pr(X=x \mid Y=y)\). The expression to the right of the conditioning bar is called the conditioner and can be any predicate that we require to be true.

For discrete random variables, conditioning is defined as follows:

\[\newcommand{\coloneqq}{\mathrel{:=}} \Pr(X=x \mid Y=y) \coloneqq \frac{\Pr(X=x,Y=y)}{\Pr(Y=y)} = \frac{\Pr(X=x,Y=y)}{\sum_{x^{\prime} \in X} \Pr(X=x^{\prime},Y=y)}\]

We can think of conditioning as a two step process. First, we get the subset of joint states where the conditioner is true. Or, equivalently, we remove all of the states where the conditioner is false. We then renormalize the joint probabilities of these states so that they are a probability distribution again by dividing through by the marginal probability of the conditioner. One can think of conditioning as zooming in on the part of the joint space we are interested in (the part where the conditioner is true) and then making the result a probability distribution by renormalizing so things add to \(1\) again.

Suppose that we want to ask about the conditional distribution \(\Pr(W^{(1)} \mid W^{(2)}=\textit{Ishmael})\). Let’s consider the joint distribution from above.

In this case, we only care about the column of the joint distribution where \(W^{(2)}=\textit{Ishmael}\).

\[\newcommand{\hlb}[1]{\bbox[Cornsilk,4px,border:2px solid LightGray]{#1}}\]
\({}_{\Large\mathbf{W}^{(1)}}\diagdown {}^{\Large\mathbf{W}^{(2)}}\) Call me Ishmael \(\Pr(\mathbf{W}^{(1)})\)
Call \(\frac14\) \(\frac18\) \(\hlb{\frac18}\) \(\frac12\)
me \(\frac18\) \(\frac1{16}\) \(\hlb{\frac1{16}}\) \(\frac14\)
Ishmael \(\frac18\) \(\frac1{16}\) \(\hlb{\frac1{16}}\) \(\frac14\)
\(\Pr(\mathbf{W}^{(2)})\) \(\frac12\) \(\frac14\) \(\frac14\)  

To make this into a conditional distribution we renormalize, dividing each value by the marginal, \(\Pr(W^{(2)}=\textit{Ishmael})=\frac14\).

\({}_{\Large\mathbf{W}^{(1)}}\) - - \(\Pr(\mathbf{W}^{(1)}\mid \mathbf{W}^{(2)}=\) Ishmael \()\) -
Call - - \(\hlb{\frac12}\) -
me - - \(\hlb{\frac14}\) -
Ishmael - - \(\hlb{\frac14}\) -

Note that in this case an interesting thing has occurred \(\Pr(W^{(1)} \mid W^{(2)}=\textit{Ishmael}) = \Pr(W^{(1)})\), that is, the conditional distribution is equal to the marginal distribution. Put another way, knowing the value of \(W^{(2)}\) doesn’t give us any new information that we didn’t already have in the marginal \(\Pr(W^{(1)})\). This is another way of defining independence, which can be expressed as follows.

\[\Pr(X=x \mid Y=y)=\Pr(X=x)\ \forall x, y\]

To see this note that if \(X\) and \(Y\) are independent:

\[\begin{aligned} \Pr(X=x \mid Y=y) &=\frac{\Pr(X=x,Y=y)}{\sum_{x^{\prime}\in X}\Pr(X=x^{\prime},Y=y)}\\ &=\frac{\Pr(X=x)\Pr(Y=y)}{\sum_{x^{\prime}\in X}\Pr(X=x^{\prime})\Pr(Y=y)}\\ &=\frac{\Pr(X=x)\Pr(Y=y)}{\Pr(Y=y)}\\ &=\Pr(X=x) \end{aligned}\]

The Chain Rule

Our definitions of conditional and marginal probabilities give rise to an important identity known as the chain rule. The chain rule says that we can pick any ordering on our random variables, and write their joint probability in terms of a product of a sequence of conditional probabilities in the following way.

\(\Pr(W^{(1)}=w^{(1)}, \dots, W^{(n)}=w^{(n)})\) \(=\)

\[\begin{aligned} &&\Pr(W^{(n)}=w^{(n)}\mid &W^{(1)}=w^{(1)}, \dots, W^{(n-2)}=w^{(n-2)}, W^{(n-1)}=w^{(n-1)})\\ \cdot&&\Pr(W^{(n-1)}=w^{(n-1)} \mid &W^{(1)}=w^{(1)}, \dots, W^{(n-2)}=w^{(n-2)})\\ &&\vdots\\ \cdot&&\Pr(W^{(3)}=w^{(3)} \mid &W^{(1)}=w^{(1)}, W^{(2)}=w^{(2)})\\ \cdot&&\Pr(W^{(2)}=w^{(2)} \mid &W^{(1)}=w^{(1)})\\ \cdot&&\Pr(&W^{(1)}=w^{(1)}) \end{aligned}\]

Note this identity is completely general; it does not rely on any independence assumptions.

What does this mean? Let’s consider the joint distribution over a string with three words over the vocabulary we have been using \(V=\{\!\textit{Call}, \textit{me}, \textit{Ishmael}\}\).

\[\Pr(W^{(1)},W^{(2)},W^{(3)})\]

The chain rule tells us that if we pick an ordering of the random variables, say \(W^{(2)},W^{(1)},W^{(3)}\), we can rewrite each probability in the table above, in terms of a product of conditional probabilities that take into account the values of all random variables earlier in the sequence. For instance, if we want to know the probability of the sequence call me Ishmael we can compute that probability in six ways.

\(\Pr(W^{(1)}=\textit{Call}, W^{(2)}=\textit{me}, W^{(3)}=\textit{Ish})\) \(=\)

\[\begin{aligned} &\Pr(W^{(1)}=\textit{Call}) &\cdot&& \Pr(W^{(2)}=\textit{me} \mid W^{(1)}=\textit{Call}) &&\cdot&& \Pr(W^{(3)}=\textit{Ish} \mid W^{(1)}=\textit{Call}, W^{(2)}=\textit{me})&=\\ &\Pr(W^{(1)}=\textit{Call}) &\cdot&& \Pr(W^{(3)}=\textit{Ish} \mid W^{(1)}=\textit{Call}) &&\cdot&& \Pr(W^{(2)}=\textit{me} \mid W^{(1)}=\textit{Call}, W^{(3)}=\textit{Ish})&=\\ &\Pr(W^{(2)}=\textit{me}) &\cdot&& \Pr(W^{(1)}=\textit{Call} \mid W^{(2)}=\textit{me}) &&\cdot&& \Pr(W^{(3)}=\textit{Ish} \mid W^{(1)}=\textit{Call}, W^{(2)}=\textit{me})&=\\ &\Pr(W^{(2)}=\textit{me}) &\cdot&& \Pr(W^{(3)}=\textit{Ish} \mid W^{(2)}=\textit{me}) &&\cdot&& \Pr(W^{(1)}=\textit{Call} \mid W^{(2)}=\textit{me}, W^{(3)}=\textit{Ish})&=\\ &\Pr(W^{(3)}=\textit{Ish}) &\cdot&& \Pr(W^{(1)}=\textit{Call} \mid W^{(3)}=\textit{Ish}) &&\cdot&& \Pr(W^{(2)}=\textit{me} \mid W^{(1)}=\textit{Call}, W^{(3)}=\textit{Ish})&=\\ &\Pr(W^{(3)}=\textit{Ish}) &\cdot&& \Pr(W^{(2)}=\textit{me} \mid W^{(3)}=\textit{Ish}) &&\cdot&& \Pr(W^{(1)}=\textit{Call} \mid W^{(2)}=\textit{me}, W^{(3)}=\textit{Ish})&\\ \end{aligned}\]

All six of these computations will give us the same joint probability.

The chain rule is important because it let’s us express complex joint distributions in many different ways as combinations of (potentially) simpler conditional distributions. If I have a joint distribution over two variables, say a corpus \(\mathbf{C}\) out of the space of possible corpora \(\mathcal{C}\) and some model parameters \(\theta\) out of a space of possible model parameters \(\Theta\), we can use the chain rule to write the distribution like so:

\[\Pr(\mathcal{C}=\mathbf{C},\Theta=\theta)=\Pr(\mathcal{C}=\mathbf{C} \mid \Theta=\theta)\Pr(\Theta=\theta)\]

or, using the slightly less precise notation,

\[\Pr(\mathbf{C}, \theta)=\Pr(\mathbf{C} \mid \theta)\Pr(\theta)\]

Note that the first term here, \(\Pr(\mathbf{C} \mid \theta)\), is just the probability of the corpus given the parameters that we saw above. What is this second term \(\Pr(\theta)\)?


14 Principles of Inductive Inference 16 Hierarchical Models