The usefulness of hierarchical probabilistic models is how they can be used together with the operation of probabilistic conditioning to derive another approach to model comparison, selection, and parameter estimation known as Bayesian inference.

The basic idea of Bayesian inference is that we treat the problem of fitting a model as one of computing a conditional distribution over the models in our class given some input data. We have seen how we can think of models in the set of categorical distributions as indexed by particular vectors \(\vec{\theta} \in \Theta\). In maximum likelihood estimation, we thought of \(\vec{\theta}\) strictly as a parameter—that is, as a fixed, given input to the model. However, now that we have introduced a distribution on \(\vec{\theta}\), we instead think of \(\vec{\theta}\) as a random variable itself, and consider the conditional distribution \(\Pr(\Theta=\vec{\theta} \mid \mathbf{C}=C, \vec{\alpha})\). Since we have defined a joint distribution \(\Pr(\Theta=\vec{\theta}, \mathbf{C}=C, \vec{\alpha})\), we know that we can compute this conditional distribution by our two-step procedure. What does this look like?

Bayes’ Rule

Combining the definition of conditional probability with the chain rule, we end up with an important special-case law of probability known as Bayes’ Rule:

\[\Pr(H=h \mid D=d) =\frac{\Pr(D=d \mid H=h)\Pr(H=h)}{\sum_{h'\in H}\Pr(D=d \mid H=h')\Pr(H=h')} =\frac{\Pr(D=d \mid H=h)\Pr(H=h)}{\Pr(D=d)}\]

Note that this is just the definition of conditional probability with \(\Pr(D=d \mid H=h)\Pr(H=h)\) substituted for \(\Pr(D=d, H=h)\) via the chain rule. Bayes’ rule is often written in this form with \(H\) standing for a random variable representing some hypothesis, for instance the set of possible categorical distributions \(\Theta\) and \(D\) standing for some data like a corpus of utterances \(\mathbf{C}\).

The term \(\Pr(H=h \mid D=d)\) is known as the posterior probability of the hypothesis given the data. The term \(\Pr(H=h)\) is known as the prior probability of the hypothesis. The term \(\Pr(D=d \mid H=h)\) is known as the likelihood of the hypothesis (or less correctly, but more often, likelihood of the data), and the term \(\sum_{h'\in H} \Pr(D=d \mid H=h')\Pr(H=h')\) is known as the evidence or, more often, marginal likelihood of the data. Note that \(\sum_{h'\in H} \Pr(D=d \mid H=h')\Pr(H=h')=\Pr(D=d)\), so this denominator is just the marginal probability of the data (marginalizing over all hypotheses).

We can use Bayes’ Rule to write our current model like this:

\[\Pr(\Theta=\vec{\theta} \mid \mathbf{C}=C,\vec{\alpha}) =\frac{\Pr(\mathbf{C}=C \mid \Theta=\vec{\theta}, \vec{\alpha})\Pr(\Theta=\vec{\theta}\mid\vec{\alpha})}{\int_{\Theta} \Pr(\mathbf{C}=C \mid \Theta=\vec{\theta},\vec{\alpha})\Pr(\Theta=\vec{\theta}\mid \vec{\alpha})\,d\vec{\theta}}\]

Note two things. First, here we have replaced the sum in the bottom with an integral. Second, notice that the way that in Bayes’ rule, we replace the joint probability \(\Pr(\Theta=\vec{\theta}, \mathbf{C}=C \mid \vec{\alpha})\) in the numerator of conditional probability with the chain-rule decomposition \(\Pr(\mathbf{C}=C \mid \Theta=\vec{\theta}, \vec{\alpha})\Pr(\Theta=\vec{\theta} \mid \vec{\alpha})\). Note that this is exactly the form of our hierarchical generative model. In other words, both the numerator and denominator in Bayes’ rule are defined using our hierarchical generative model.

There are many different ways to interpret Bayes’ rule. It represents a way of mapping from prior beliefs to posterior beliefs about a domain. This mapping can be thought of either as a reweighting of the prior beliefs by the likelihood, or as a reweighting of the likelihood of the model given how a priori plausible a belief was.

Bayesian Inference for the BOWs Model

Recall that our likelihood was given by the following:

\[\Pr(\mathbf{C} \mid \vec{\theta}) = \prod_{w \in V} \theta_{w }^{n_{w }}\]

And a Dirichlet prior over \(\theta\) with pseudocounts \(\alpha\) is given by the following:

\[\Pr(\vec{\theta} \mid \vec{\alpha})={\frac {\Gamma \left(\sum_{w \in V}\alpha _{w}\right)}{\prod _{w \in V} \Gamma (\alpha _{w})}} \prod_{w \in V}\theta_{w}^{\alpha _{w}-1}\]

So the joint probability of our model is given by the following expression.

\[\Pr(\mathbf{C}, \vec{\theta} \mid \vec{\alpha}) = \Pr(\mathbf{C} \mid \vec{\theta})\Pr(\vec{\theta} \mid \vec{\alpha}) = {\frac {\Gamma \left(\sum_{w \in V}\alpha _{w}\right)}{\prod _{w \in V} \Gamma (\alpha _{w})}} \prod_{w \in V} \theta_{w}^{[n_{w}+\alpha _{w}]-1}\]

Plugging this into Bayes’ Rule we get:

\[\Pr(\vec{\theta} \mid \mathbf{C}, \vec{\alpha}) = \frac{ {\frac {\Gamma \left(\sum_{w \in V}\alpha _{w}\right)}{\prod _{w \in V} \Gamma (\alpha _{w})}} \prod_{w \in V} \theta_{w}^{[n_{w}+\alpha _{w}]-1}}{\int_{\Theta} \left[ {\frac {\Gamma \left(\sum_{w \in V}\alpha _{w}\right)}{\prod _{w \in V} \Gamma (\alpha _{w})}} \prod_{w \in V} \theta_{w}^{[n_{w}+\alpha _{w}]-1} \right]\,d\vec{\theta}}\]

Replacing the denominator with the expression we derived at the end of the Hierarchichal Models module for the marginal likelihood of the corpus, we get:

\[\Pr(\vec{\theta} \mid \mathbf{C}, \vec{\alpha}) = \frac{ {\frac {\Gamma \left(\sum_{w \in V}\alpha _{w}\right)}{\prod _{w \in V} \Gamma (\alpha _{w})}} \prod_{w \in V} \theta_{w}^{[n_{w}+\alpha _{w}]-1}}{\frac{\Gamma(\sum_{w \in V} \alpha_{w})}{\prod_{w \in V} \Gamma(\alpha _{w})} \frac{\prod _{w \in V} \Gamma ([n_{w}+\alpha_{w}])}{\Gamma \left(\sum_{w \in V}[n_{w}+\alpha_{w}]\right)}}\]

Simplifying, we derive the posterior probability of \(\vec{\theta}\).

\[\Pr(\vec{\theta}\mid \mathbf{C}, \vec{\alpha})= {\frac {\Gamma \left(\sum_{w \in V}[n_{w}+\alpha _{w}]\right)}{\prod _{w \in V} \Gamma ([n_{w}+\alpha _{w}])}}\prod_{w \in V}\theta_{w}^{[n_{w}+\alpha _{w}]-1}\]

Notice that this posterior distribution over \(\vec{\theta}\) has a very recognizable form: it is another Dirichlet distribution except that we have incremented the pseudocounts \(\vec{\alpha}\) with our empirical counts \(\vec{n}\)!

Using the Bayesian Posterior

Note that unlike maximum likelihood modeling, Bayesian inference gives us a probability distribution over our models (indexed by \(\vec{\theta}\)) rather than simply a point estimate of a single model (\(\hat{\theta}\)). If we want to apply our Bayesian posterior distribution to make predictions about new datapoints, we have several options.

  1. We can make predictions based on the highest value of \(\vec{\theta}\). This is known as the Maximum A Posteriori (MAP) estimate.
  2. We can a sample a value from the posterior and use the sampled value.
  3. We can make our prediction with every value of \(\vec{\theta}\) and base our ultimate prediction on a posterior-weighted average of these particular predictions. This is known as using the posterior predictive distribution.

Posterior Predictive Distributions

The idea behind the Bayesian posterior predictive distribution is that we might not want to commit to a single hypothesis in our posterior distribution, but instead might want to average over all of our posterior values.

Imagine that our corpus consisted of \(N\) words, and this we computed a posterior distribution \(\Pr(\vec{\theta} \mid w^{(1)},...,w^{(N)},\vec{\alpha})\). Based on this posterior distribution, what is the probability that the next word, \(W^{(N+1)}\), takes on value \(w_i\) (e.g., what is the probability that the next word is Ishmael). In other words, what is the probability.

\[\Pr(W^{(N+1)}=w_i \mid w^{(1)},...,w^{(N)},\vec{\alpha})\]

Note that this is, once again, a marginal probability marginalizing over \(\theta\).

\[\Pr(W^{(N+1)}=w_i \mid w^{(1)},...,w^{(N)},\vec{\alpha}) = \int_{\Theta} \Pr(W^{(N+1)}=w_i \mid \vec{\theta}) \Pr(\vec{\theta} \mid w^{(1)},...,w^{(N)},\vec{\alpha})\,d\vec{\theta}\]

Notice that the first term in this integral is just \(\theta_{w_i}\) the probability of \(w_i\). The second term, is just the posterior probability of \(\theta\), which takes the form of a Dirichlet distribution, as we derived above.

\[\Pr(W^{(N+1)}=w_i \mid w^{(1)},...,w^{(N)},\vec{\alpha}) = \int_{\Theta} \theta_{w_i} {\frac {\Gamma \left(\sum_{w \in V}[n_{w}+\alpha _{w}]\right)}{\prod _{w \in V} \Gamma ([n_{w}+\alpha _{w}])}}\prod_{w \in V}\theta_{w}^{[n_{w}+\alpha _{w}]-1}\,d\vec{\theta}\]

Using just algebra and the fact that \(\Gamma(m+1) = m\Gamma(m)\), it is possible to show that.

\[\Pr(W^{(N+1)}=w_i \mid w^{(1)},...,w^{(N)},\vec{\alpha}) = \frac{n_{w_i}+\alpha _{w_i}}{\sum_{w^\prime \in V}[n_{w^\prime}+\alpha _{w^\prime}]}\]

In other words, the probability that the next word is \(w_i\) is just the pseudocount of \(w_i\) plus the number of times that you have seen \(w_i\) before, renormalized. This is exactly equivalent to additive smoothing with an additive constant of \(\alpha_i\) for each word \(w_i\).

Sequential Sampling: The Pólya urn scheme

Suppose that I give you a vocabulary and a set of pseudocounts for that vocabulary. You follow the following sampling scheme.

  1. Take the set of pseudocounts \(\vec{\alpha}\) and normalize it.
  2. Sample a word from the resulting categorical distribution.
  3. If you sampled word \(w_i\), then add \(1\) to pseduocount \(\alpha_i\).
  4. Return to step 1 with your updated pseudocounts. Repeat $N$ times.

This will give you a sequence of \(N\) random variables \(W^{(1)},...,W^{(N)}\).

Compare the following sampling scheme.

  1. Sample a parameter vector \(\vec{\theta} \sim \operatorname{Dirichlet}(\vec{\alpha})\)
  2. Sample $N$ words independently from \(\operatorname{Categorical}(\vec{\theta})\).

This will also give you back a sequence of words \(W^{(1)},...,W^{(N)}\).

These two sequences of random variables have exactly the same distribution. What you are doing in the first version is updating your posterior predictive after each sample and then using that to sample the next observation. In other words, you are sequentially conditioning on your own sequence of generated observations.

This sequential sampling process integrates out \(\vec{\theta}\) incrementally, and is known as the Pólya urn scheme representation of the Dirichlet-categorical distribution.


17 Balancing Fit and Generalization 19 Dependencies between Words