A fundamental concept in probability theory is the expected value or expectation of a random variable. Expectations are just a generalization of the idea of a mean or average to arbitrary probability distributions.

Suppose I have a set of numbers, \(N\). How do I compute the mean or average of this set of numbers?

\[\sum_{n \in N} \frac{1}{|N|}n\]

I can also compute a mean or average for some function of this set of numbers, \(f\).

\[\sum_{n \in N} \frac{1}{|N|}f(n)\]

The mean assumes that each value is equally weighted (hence the \(\frac{1}{|N|}\) term). Instead of \(N\) being a set, let’s assume that \(N\) is a random variable with an associated probability mass function \(p\) and weight element by \(p(n)\).

\[\sum_{n \in N} p(n)f(n)\]

This quantity is known as the expected value or expectation of the function \(f\) with respect to the distribution \(p\). We often write this with the following notation.

\[\mathbb{E}_{n \sim p} [f(N)] = \sum_{n \in N} p(n)f(n)\]

When it is clear which distribution we are taking expectations with respect to, we often write.

\[\mathbb{E} [f(N)] = \sum_{n \in N} p(n)f(n)\]

We sometimes make use of a special real valued random variable called the Dirac \(\delta\) function. The function \(\delta_{x}(\cdot)\) returns \(1\) when it’s argument is equal to \(x\) and \(0\) otherwise. It is also sometimes written as \(1_{x}(\cdot)\).

Note that the expected value of the Dirac \(\delta\) function of some value is its probability.

\[\mathbb{E} [\delta_{x}(N)] = \sum_{n \in N} p(n)\delta_{x}(n) = p(x)\]

This shows that we can think of integrating or marginalizing as taking an expectation with respect to the Dirac \(\delta\) function.

Expected values are linear (in fact, convex) combinations of (some function) of the values in the sample space of some random variable. Thus, the expectation is linear meaning that the following properties hold in general.

\[\mathbb{E} [X + Y] = \mathbb{E} [X] + \mathbb{E} [Y]\]

and

\[\mathbb{E} [aX] = a\mathbb{E} [X]\]

for arbitrary constant \(a\).

An important inequality involving expectations that comes often in probabilistic modeling is Jensen’s inequality which says that for any concave function \(\varphi\)

\[\varphi \left(\mathbb{E} [X]\right) \geq \mathbb{E} \left[\varphi (X)\right]\]

34 Parameter Estimation and Learning 36 The Baum-Welch Algorithm