This is a course on the foundations of computational linguistics.


Most courses taught on computational linguistics or natural language processing tend to be fairly advanced survey courses, assuming a large amount of knowledge of computer science and covering a variety of somewhat unrelated modeling techniques, problem domains, or tasks. This course takes a different approach, seeking to build up a small set of conceptual tools and empirical methods from ground up.

Some specific goals of the course include:


The modern study of language originated in the fifties when tools from the recently-emerged theory of computation started to be applied to the problem of language. One important early result in this area was the Chomsky Hierarchy which gave a containment hierarchy of phrase-structure grammars or grammars built out of rewrite rules like the following:

Chomsky showed that restricting the form of these rules into four classes (left-linear, context-free, context-sensitive, and unrestricted) gave rise to classes of formal languages which were strictly nested.

This remains a foundational result in theoretical computer science and locating natural language in the Chomsky hierarchy remains a goal of modern linguistics. The Chomsky Hierarchy up to the context-free language will be our way of organizing our study of linguistic models. Along the way, we will cover several refinements of this hierarchy and extensions using probability.

What this course is not

What you need: curiosity, ambition, fearlessness, and ability to use Google to find answers to technical problems.

What is the difference betwen COMP 550 and COMP/LING 445?

The two courses cover overlapping but complementary material, with differences in assumed background knowledge and focus. While some topics will appear in both courses, the emphasis will be different enough such that you can fruitfully take both courses (if you are an undergraduate student).

COMP 550 assumes more computational background, including deep familiarity with probability and algorithms. COMP 550 focuses on technological perspectives of natural language processing as a subdiscipline of artificial intelligence. It covers computational semantics, discourse, and applications such as automatic summarization and machine translation, which COMP/LING 445 does not.

COMP/LING 445 goes in depth into the fundamental and formal mathematical and linguistic principles that underlie modern computational linguistics, with an emphasis on applications in linguistic analysis. It draws connections to automata theory, and rigorously derives some of models that form the basic analytical toolbox of computational linguistics, which COMP 550 does not.


We will use Clojure, a LISP based on the -calculus. Clojure has several advantages.


General LISP and Clojure learning resources: