Better Text Searching with PEG

In our last entry, I bemoaned the shortcomings of regular expressions for complex tasks. (This was after spending a day wrestling with a three-page-long RegEx pattern for finding functions in a C# TextWrangler language module.) I sketched out what I thought an ideal string-matching system would look like.

Well, that was three weeks ago. I've had time to do some more serious research, and it turns out that there is some modern work that is very relevant. It almost fits exactly what we were looking for — but not quite. It's a new construct called Parsing Expression Grammar.

Parsing Expression Grammars (PEGs) were introduced in 2004 by Bryan Ford of MIT. His aim was to provide a new way of describing computer langugaes, as an alternative to context-free grammars (CFMs) and regular expressions (REs). Both CFMs and REs are "generative," meaning that they provide a way to generate any valid statement or pattern in a given language. However, in practice, they're never used that way; they're used in the other direction, to recognize how a given statement or pattern might fit the language. But it turns out that they're not very good at this, because such grammars are ambiguous. PEGs, on the other hand, are based on recognition, and are unambigous. They also have a lot of other neat properties, such as guaranteed linear-time parsing (i.e. they're fast), and they are expressive enough to do the job of both REs and CFMs with a single syntax.

So what does a PEG look like? It's very much like what we were looking for in our last blog entry. Whitespace is not significant, which allows you to space your pattern out for readability, and break it into lines for logical grouping. Let's dive right into an example:

Begin <- '(*'
End   <- '*)'
Comment <- Begin N* End
Guts  <- Comment / (!Begin !End .)
start <- Comment

This pattern matches Pascal-style comments, which (unlike block comments in C) can be nested. This is a classically difficult pattern to match in RegEx, which is not recursive. In PEG, it's fairly easy. Let's break it down step by step:

The first two lines define subpatterns called "Begin" and "End" that each match a string literal.

The "Comment" subpattern is defined as a Begin, zero or more "guts" (to be defined later), and an End.

The "Guts" pattern defines what can go inside of a comment block. This is either a comment, or any character (represented by the dot) that is not a Begin or End string (represented by the "!Begin" and "!End" predicates). It's important to note that when alternatives are given in PEG, as on this line, then those alternatives are tried in order. So as the parser is scanning the guts of a comment, if it comes across something that is itself a full comment, then it matches that first; only if that fails will it look for a character that couldn't be the start or end of a comment.

Finally, the "start" pattern is just a special token that defines what the pattern as a whole must match — in this case, a comment.

As you can see, it's all pretty readable and clear. After studying PEG for a few weeks, I'm convinced that it really is the future of defining programming languages. No more RegEx-based lexing and CFM-based parsing; a simple PEG grammar that defines everything is both easier and more powerful.

However, I'm not convinced that PEG, by itself, is a perfect substitute for RegEx in general string searching and substitution. There are a few things that make PEG suboptimal for this purpose, but the big one is this:

PEG does no real backtracking. This makes many patterns much harder to write than they should be, at least for a novice user. Because it does no backtracking, the repeat operators are always greedy (since if it were not, it would always match the minimum number of repeats).

As discussed last time, greedy matching is a common source of mistakes in RegEx, and it's far worse in PEG, since it lacks backtracking. For example, to match bold text in an HTML document, you might try this:

	'' .+ ''

But this won't work, because the ".+" will execute fully before PEG even looks at the next part of the sequence (the end tag). In other words, ".+" or ".*" in PEG will always match all the way to the end of the file. To write this pattern properly in PEG, you have to use a not-predicate:

	'' (!'' .)+ ''

...which is considerably harder to read (and write). I think for language authors, learning this noise syntax is no worse than learning to put curly braces and semicolons in the right places in C. But for casual users of string searching — i.e., someone using the Find/Replace dialog in their favorite text editor or word processor — this is asking too much.

Parsing Expression Grammars are an important step forward, and despite decades of history with REs and CFGs, I think we're going to see PEGs become the standard tool for language design. I also believe that we'll see something PEG-derived become standard for string searching. We've started work on that derivation already... tune in next week to see how it's going!