In the last unit, we saw the notion of a formal language and how it could be used to formalize the idea of grammaticality. In this unit, we will discuss how to use the notion of a class of formal languages to capture the idea of a model for natural language.

Classes of Formal Languages

Grammaticality is defined with respect to specific languages, like English or French, because the set of well-formed sentences is different between each pair of human languages. As a result, our model of grammatical and ungrammatical strings—a formal language—is a model of individual human languages. Our ultimate goal, however, is to a achieve a computational model of all possible human languages or, in linguistic terms, universal grammar. In other words, we want to define a class of computational systems which can represent all possible human languages and no languages which are not possible human languages. To model this notion, we will need one more concept: a class of formal languages. A class of formal languages \(\mathcal{C}\) is a subset of \(2^{V^{*}}\), that is, a set of formal languages.

Examples of Classes of Languages

What are some examples of classes of formal languages? A class of formal languages can be defined in many ways. One way to define a class of languages is to specify a property that every language in the class must possess. An example of this approach is the class \(\mathcal{FIN}\) of all finite subsets of \(V^*\). Another way to define a class of formal languages is by reference to a machine model that is able to define particular languages. We pick a machine model, say \(\mathbf{M}\) and say that the class \(\mathcal{M}\) is any language definable by that machine model. So for instance, the class \(\mathcal{TM}\) is the set of formal languages which can be defined by a Turing machine. Or we might choose the class of finite-state automata and define the class of so-called regular languages \(\mathcal{REG}\) that are defined by these machines. In this course, we will see many examples of classes of languages defined this way.

Closure Properties of Classes of Formal Languages

An important concept in formal language theory is closure of a class of languages under a set-theoretic operation. Closure refers to the idea that applying an operation to languages in the class always results in more languages in the class. More precisely, suppose that there is a unary operation \(o\), such that if any language \(L\) is in the class, then \(L^\prime=o(L)\) is also in the class. Or, for a binary operation \(o\), if any languages \(L_1\) and \(L_2\) are in the class, then so is \(L^\prime=o(L_1,L_2)\). When a property like the above holds, we say the class is closed under the operation \(o\). As we will see shortly, closure is an important tool for proving that certain languages cannot be contained in a particular class.

A Model of Well-Formed Sentences

How can we use the ideas we just defined?

We have already discussed how we will use the notion of a formal language to formalize the idea of grammaticality: All the well formed sentences are within some formal language \(L\) and all other possible strings \(\mathord{\sim} L\) are ill-formed. Now, what is a good model of human language and universal grammar? As we already discussed, for this we will need a class of languages that is meant to capture any possible human language.

As a first idea, let’s consider the possibility that the class of human languages is actually just \(\mathcal{FIN}\), that is the set of all possible finite languages (we will set aside the issue of the vocabulary for now). This means that any particular language is just a (potentially very large) finite list of possible sentences. For example, if our lexicon is \(V = \{a,b\}\), the language \(L=\{a,aba,bbb,ab\}\) is in \(\mathcal{FIN}\), but \((ab)^*\) is not. How will this work out as a model of well-formed sentences?

Let’s consider a very restricted subset of the English language. We will start with a single sentence:

Alice talked to the father of Bob.

This is clearly a grammatical sentence in the English language, and we can use it as the basis for a sequence of longer and longer sentences:

Alice talked to the father of the mother of Bob.

Alice talked to the father of the mother of the father of Bob.

Alice talked to the father of the father of the mother of Bob.

Alice talked to the father of the mother of the father of the mother of Bob.

Alice talked to the father of the father of the mother of the mother of Bob.

These sentences were constructed by adding the prepositional phrases of the mother and of the father to the base sentence. It is clear that each of these new sentences is grammatical, and there is no reason to stop after only four iterations of this process: In principle we can add these prepositional phrases as many times as we want, and the resulting sentence will be grammatical. This process is unbounded. Of course, eventually sentences will get too long to be of practical for normal use. But, there is no specific number of prepositional phrases below which the sentences of are grammatical and above which the sentences are not.

In order to characterize this subset of English, it is useful to simplify the strings we are talking about. Let’s define a language homomorphism \(h\) from strings of English words to strings over the vocabulary \(\{a,b\}^*\) (i.e., \(E^* \to \{\epsilon,a,b \}^*\)). Recall that every complete specification of behavior on the vocabulary of some language can be used to define a unique homomorphism. Thus, let’s define \(h\) by saying which string in \(\{a,b\}^*\) each word of English \(E\) maps to.

Every time that we see the word father, we will replace it with the character \(a\), and every time we see the word mother, we will replace it with the character \(b\). Every other word will get mapped to the null string \(\epsilon\).

\[\begin{eqnarray} h(father) & = &a\\ h(mother) & = & b\\ h(x) & = & \epsilon \ \forall\ x \notin \{father,mother\} \end{eqnarray}\]

The first sentence, containing only one instance of father, will then be transformed to the string \(a\). The second sentence, containing father and then mother, will be transformed to \(ab\). Continuing in this manner, we see that this fragment of English will get mapped to the language

\[L = \{\epsilon,a,b,ab,ba,aaa,bbb,baa,aba,baa,aab,abab,...\}\]

This language \(L\) captures some of the structure of the sentences above. Notice that other sentences using the relevant words, like the mother saw the father, are also mapped into strings in this language, even when they aren’t used in prepositional phrases. Thus our homomorphism is defined for every sentence of English. Also note that \(L\) is just the set of all possible strings or any length over the vocabulary \(\{a,b\}\), that is \(\{a,b\}^*\).

Now, it turns out that the class of \(\mathcal{FIN}\) is closed under homomorphism. This simply means that if we start from a language in \(\mathcal{FIN}\) and apply any homomorphism, the resulting language is also \(\mathcal{FIN}\). Thus, if we apply a homomorphism and do leave \(\mathcal{FIN}\), we must not have started there. Therefore, English cannot be in \(\mathcal{FIN}\)—no finite list will contain all of its grammatical sentences. And thus, \(\mathcal{FIN}\) cannot be a model of universal grammar.

It is worth noting a few things. First, we are not arguing that every natural language must be infinite; we may find one that is finite. Nevertheless, this does show that \(\mathcal{FIN}\) cannot be the correct model for all languages since at least one is not in it. Second, our use of infinite languages is an idealization. Of course, no amount of English that we sample in the wild will be infinite, we will always observe just a finite subset of the language. The point is that there is no particular bound on the length of sentences which makes sense to specify a priori. The infinite language captures our intuition that the process of sentence formation in English is unbounded the number of times you are allowed to do something is not specified in advance, not that there exists an actual infinite set of English sentences somewhere in the universe.


5 Formal Languages 7 Infinite Languages and Recursion