The principle of maximum likelihood is just the first of several principles of inductive inference we will see in this course. In general, a principle of inductive inference gives a way of choosing between different hypotheses or models in light of some data. It will be useful to formalize this idea a little for comparison purposes later in the course. An inductive inference problem is a triple that consists of three parts: a hypothesis space, \(H\), a dataset \(D\), and a quality function \(Q\): \(\langle H, D, Q\rangle\). Typically, our hypotheses are unobserved or latent. In the case of the bag of words model, our hypothesis space was the simplex, and the individual hypotheses were points in this space of probability vectors; that is, distributions over words. Our dataset was our corpus. The quality function returns a numerical measure of quality of each hypothesis given the dataset. In the case of the principle of maximum likelihood we based this numerical quantity on the likelihood of the dataset, preferring the maximum likelihood solution to all others. One way to capture this is to define a function which assigns \(0\) to all hypotheses besides the argmax. A \(\delta[\cdot]\) function is a higher order function that takes in a predicate, and returns \(1\) when its argument is true, and \(0\) otherwise. Using this function we can define the quality function for the principle of maximum likelihood like so

\[\DeclareMathOperator*{\argmax}{\arg\max} \begin{aligned} Q^{\mathrm{ML}}(\vec{\theta}, \mathbf{C}) &=\delta\Big[ \vec{\theta}=\argmax_{\vec{\theta}{}' \in \Theta } \mathcal{L}(\vec{\theta}{}';\mathbf{C}) \Big]\\ &=\delta\Big[ \vec{\theta}=\argmax_{\vec{\theta}{}' \in \Theta } \sum_{w \in V} n_{w}\log\theta'_{w} \Big]\\ &=\delta\Big[ \vec{\theta}=\vec{\hat{\theta}}{}^{\mathrm{ML}} \Big] \end{aligned}\]

In other words, \(Q\) is that function which returns \(1\) for the maximum likelihood parameter vector \(\vec{\hat{\theta}}{}^{\mathrm{ML}}\) and \(0\) otherwise. Thus, we can define our problem of inductive inference for maximum likelihood as follows: \(\langle \Theta, \mathbf{C}, Q^{\mathrm{ML}}\rangle\).


13 Smoothing 15 Independence, Marginalization, and Conditioning