We have seen how to define the class of formal languages known as the strictly local languages and how to make these languages into probabilistic $n$-gram models. Now we turn back to the question that has been motivating this course: Are these models sufficient to capture the distinction between grammatical and ungrammatical sentences in natural languages.

We saw in Classes of Formal Languages that English could not be modeled using a finite formal language because of recursion and we saw in Dependencies between Words that bag-of-words models can’t capture English grammatical structure because they always generate every possible string in $V^*$. Strictly local grammars solve both problems. They can generate infinite languages and they do not always generate every possible string. Are they adequate models for natural language syntax?

To answer this question, let’s consider the logic that allowed us to rule out the finite languages and the bag-of-words languages as models of natural language syntax. In each case, we found a single property possessed by the languages in the class that were not possessed by English. These were finiteness and permutation-completeness. These properties allowed us to demonstrate that natural language is not in these classes of languages

How can we find a property which will allow us to determine whether or not natural languages can be described by strictly local grammars and $n$-gram models. In this section, we introduce the idea of an abstract characterization of a class of formal languages. An abstract characterization is a property of a formal language, stated only in terms of the strings of the language, that is a necessary and sufficient condition on the language being in the target class. That is, it is a property that every language in the class possesses, and which, if possessed by a language, means that language must be in the class. For the strictly local languages the abstract characterization we will examine is known as Suffix Substitution Closure (SSC).

$n$-Local Suffix Substitution Closure ($n$-SSC)

A language $L$ is a strictly $n$-local language if and only if for every \(u_1, u_2, v_1, v_2 \in V^*\) and for every $x \in V^{n-1}$ if $u_1 x v_1$ is in $L$ and $u_2 x v_2$ is in $L$, then so is $u_1 x v_2$. Or in more mathematical notation:

\[L \in \mathrm{SL}_n \iff \forall u_1, u_2, v_1, v_2 \in V^*, \forall x \in V^{n-1}\ [u_1 x v_1 \in L \wedge u_2 x v_2 \in L \implies u_1 x v_2 \in L]\]

What does this mean? A key intuition here has to do with the amount of string memory that an $n$-gram model has as it generates a string. A strictly $n$-local grammar generates each symbol based on the preceding $n-1$ symbols. Thus, it can only ever “look back” $n-1$ symbols and future choices of symbols cannot depend on symbols chosen before that window.

$n$-local Suffix Substitution Closure (SSC) starts with two strings from the language the form shown in the table below, where $x$ is length $n-1$. Let’s call the string $x$ the pivot. It is the shared substring between strings $u_1xv_1$ and $u_2xv_2 $.

Beginning Pivot Rest
$u_1$ $x$ $v_1$
$u_2$ $x$ $v_2$

It claims that if those two strings are in the language, so is the one in the table below.

Beginning 1 Pivot Rest 2
$u_1$ $x$ $v_2$

The idea is that the grammar can generate $x$ after both $u_1$ and $u_2$, and it can generate both $v_1$ and $v_2$ after $x$. But since the grammar can only see $n-1$ symbols back, it can’t tell if $x$ appears after $u_1$ or $u_2$, and then must also generate the string $u_1xv_2$.

$n$-SSC says if you have a strictly $n$-local language $L$, then whenever you can find a pivot $x$ of the right length (i.e., $n-1$) that lines up two strings, then the strings to the right and left of the pivot can be swapped and the resulting string is in $L$.

It also says that if any set of strings that has the property that all such $(n-1)$-pivot-based crosses are in the set (i.e., the set is closed under such swaps), this set is strictly $n$-local, that is, it can be generated by an $n$-gram model.

Proof that $n$-SSC characterizes $n$-local languages

To prove that $n$-SSC is a characterization of $n$-local languages, we need to prove both directions: First we prove that any any $n$-local language satisfies $n$-SSC

\[\tag{1}\label{forward} L \in \mathrm{SL}_n \implies n\!\operatorname{-SSC}(L)\]

then we prove that any set satisfying $n$-SSC is $n$-local

\[\tag{2}\label{backward} n\!\operatorname{-SSC}(L) \implies L \in \mathrm{SL}_n\]


Proof of \eqref{forward}

If $L \in \mathrm{SL}_n$ then there exists a grammar $\mathcal{G} =\langle V, T \rangle$ where $T \subseteq \mathbf{F}_n({\rtimes} \cdot V^* \cdot {\ltimes})$.

First, note for any string $s \in L$, by definition $\mathbf{F}_n(s) \subseteq T$. Second, note that for all substrings $s^\prime$ of $s \in L$ such that $|s^\prime| \geq n$, $\mathbf{F}_n (s^\prime) \subseteq T$.

Consider the statement of $n$-SSC. Choose $s_1=u_1xv_1 \in L$ and $s_2=u_2xv_2 \in L$ with \(|x|= n-1\). We must show that $u_1 x v_2 \in L$.

Since $u_1x$ is a substring of $u_1xv_1 \in L$, it follows that $\mathbf{F}_n(\rtimes u_1x) \subseteq T$. Likewise, since $xv_2$ is a substring of $u_2xv_2 \in L$, it follows that $\mathbf{F}_n(xv_2 \ltimes) \subseteq T$.

Now, since the set of factors in the string $u_1 x v_2$, $\mathbf{F}_n(\rtimes u_1 x v_2 \ltimes)$ is the union of the two sets of factors $\mathbf{F}_n(\rtimes u_1 x) \cup \mathbf{F}_n(x v_2 \ltimes)$ and these are both subsets of $T$, $\mathbf{F}_n(\rtimes u_1 x v_2 \ltimes)$ is also a subset of $T$.

So $u_1 x v_2\in L$, and thus $n$-SSC holds of $L$.

Proof of \eqref{backward}

To show the opposite direction, we need to show that if a set of strings $L$ satisfies $n$-SSC, then there exists a grammar $\mathcal{G} =\langle V, T \rangle$ that generates exactly that set of strings, that is, $L({\mathcal{G}}) = L$.

Let $V$ be $\mathbf{F}_1(L)$ and $T$ be $\mathbf{F}_n(L)$. That is, let $\mathcal{G} = \langle \mathbf{F}_1(L), \mathbf{F}_n(L) \rangle$. We now show that $L(\mathcal{G})=L$.

\[\newcommand{\ms}[1]{ { \color{blue}{#1}} }\]

Now we have proven that implication holds in both directions:

\[n\!\operatorname{-SSC}(L) \iff L \in \mathrm{SL}_n\]

So, $n$-SSC is an abstract categorization of the $n$-strictly local languages.


Using $n$-SSC

Typically, we use $n$-SSC to show that a language is not strictly $n$-local. To do this, we usually use $n$-SSC in its negative form

\[\neg n\!\operatorname{-SSC}(L) \implies \neg (L \in \mathrm{SL}_n)\]

Recall that \(n\!\operatorname{-SSC}(L)\) means that for any two strings in \(L \in \mathrm{SL}_n\), and any pivot of length $n-1$ we can swap the post-pivot substrings and the resulting string will also be in $L$. Thus, to show that a set of strings is not strictly $n$-local, it suffices to find one pair of strings and one pivot of length $n-1$ that does not allow such a swap. Let’s look at an example.

Take the language $L= {a b^* a}$, that is, the set of all strings with $a$’s at the ends. We claim that this language is not strictly $2$-local. According to 2-SSC, we should be able to pick any two strings $s_1$ and $s_2$ in $L$, and it must be the case that for all length $1$ pivots $x$ that are shared between $s_1$ and $s_2$, if $s_1 = u_1 x v_1$ and $s_2 = u_2 x v_2$ then $u_1 x v_2$ is also a member of $L$.

Our new string is now $\rtimes a b b a b b b a \ltimes \notin L$. Thus, $L \notin \mathrm{SL}_2$

Suffix Substitution Closure for the class $\mathrm{SL}$

We can define a generalization of suffix substitution closure for the entire class $\mathrm{SL} = \bigcup\limits_{n \in \mathbb{N}} \mathrm{SL}_{n}$.

If $L \in \mathrm{SL}$ then there exists a $n$ such that for all $u_1, u_2, v_1, v_2 \in V^*$ and $x \in V^{n-1}$ if $u_1 x v_1 \in L$ and $u_2 x v_2 \in L$ then $u_1 x v_2 \in L$. In other words, a language is in $\mathrm{SL}$ if there exists an $n$ such that $n$-SSC holds for that language.

Is Natural Language Strictly Local?

Is natural language strictly local? We claim that there is no $n$ for which all natural languages are strictly $n$-local. We demonstrate this by showing a construction in the world’s languages which cannot be captured by any strictly local grammar following the same logic that we used to demonstrate that natural language could not be contained in the finite languages. See chapter Classes of Languages.

Many languages contain a polar question particle which makes a sentnece into a yes-no question. Examples include the Latin particle -ne, the Mandarin particle ma and the particle -tu which appears in some dialects of Quebec French. These particles vary in where they can appear. The Latin particle always appears on the second word of the sentence, the Mandarin particle at the end of the sentence and the Quebec French particle on the main verb of the sentence. Nevertheless, all of these systems share an important property: The question particle can appear only once in a sentence.

Consider the following homomorphism where q refers to the question particle.

\[\begin{eqnarray} h(\epsilon) & = & \epsilon\\ h(q) & = &b\\ h(x) & = & a\ \forall\ x \notin \{\epsilon,q\} \end{eqnarray}\]

Sentences which do not contain contain the question particle get mapped into \(\{a^*\}\) while sentences which do contain the question particle get mapped into \(\{a^*ba^*\}\). Thus, the language of this homomorphism is the language with at most one $b$. Let’s call this language $L_{\leq 1b}$.

One can show that the strictly local languages are closed under language homomorphism, meaning that if you apply such a homomorphism you cannot produce a language that isn’t strictly-local if you start with one that is.

We claim that $L_{\leq 1b}$ is not strictly local. In order to prove that, we need to show that there does not exist a $n$ where $n$-SSC holds for this language.

We imagine that we are playing a game with an adversary. The adversary first picks some length $n$. Then it is our turn, and we are allowed to choose $u_1xv_1$ and $u_2xv_2$, such that $|x|=n-1$ and such that $n$-SSC does not hold, i.e., $u_1xv_2 \notin L$. Remember that since $n$-SSC requires that $u_1xv_2$ be in $L$ for any two strings $u_1xv_1$ and $u_2xv_2$ and any $x$ with length $n-1$. Thus, merely exhibiting example where this isn’t true for any $n$ chosen by the adversary will get us our desired result.

We finish our proof in the following way. If the adversary chooses $n$ choose the following two strings $s_1 = \rtimes a b a^{n-1} \ltimes$ and $s_2 = \rtimes a^{n-1} b a \ltimes$ clearly these are both in $L_{\leq 1b}$. Now let $x = a^{n-1}$. By $n$-SSC, $\rtimes a b a^{n-1} b a \ltimes \in L$ is also in the language. But this is in fact not in $L_{\leq 1b}$, thus $L_{\leq 1b}$ cannot be strictly local for any $n$.

Thus, at least some natural languages cannot be captured by any $n$-gram model.

This also illustrates an important intuitive property of the strictly local languages. These languages cannot count the number of times some symbol appears in a string if the symbol can appear at arbitrary positions in the string. In the next units, we will begin to see some formalisms that can count.


21 N-gram Models 23 Latent Structure