BOSS: Sneak Peek

In the last couple of blog posts, we first reviewed some of the shortcomings of regular expressions (RegEx). We then took a look at parsing expression grammars (PEGs), which are a new formalism that has a lot of advantages for defining (and more importantly, parsing) computer languages. But while they're great for that, using them directly for string searching is a bit of a square-peg-round-hole situation.

So, we at Luminary Apps have begun work on a string matching library that combines the best features of PEG and RegEx. This blog post is the first public discussion of that library. It's called BOSS, and I think you're going to love it.

BOSS is designed to satisfy several goals:

  1. Readable, simple syntax with few special characters
  2. What special characters it does have, should feel familiar to RegEx users
  3. At least as expressive as RegEx
  4. Especially good at complex patterns, but still good for simple ones
  5. Partial matching (with reporting of what can come next)
  6. Easily extended to do lex/yacc type jobs

The big idea is to provide more powerful string searching and reporting, for those cases where RegEx becomes too onerous to use — but still be simple and familiar enough for everyday use.

With that in mind — and the understanding that all this is in development, and subject to change — let's dive into the syntax.

The most obvious difference between RegEx and BOSS syntax is that, in BOSS, string literals (i.e. characters which match themselves) must be quoted. This greatly reduces the number of "special" characters, and the amount of obtuse syntax needed. It also allows complex expressions to be easily broken up and even commented for clarity, as well as enabling named character classes. Literals can be quoted with either single- or double-quotes. So, a BOSS pattern to search for any occurrence of "the" could be written as




Borrowing from RegEx, BOSS supports the period (dot) character to match any character. So, if you wanted to match "th" plus any one additional character, you would write this as:

	"th" .

BOSS adds an optional repeat count, similar to some RegEx implementations. This can take one or two arguments; the first is the "goal" repeat count (how many repeats BOSS will prefer to match), and the second is the "acceptable" repeat count (how many will be allowed, through backtracking, in order to match the rest of the pattern). So, suppose we want to match a "th" followed by one or two e's, but we prefer two if possible:

	"th" "e"{2,1}

This does a greedy match, which is almost always what you want at the end of your pattern, since otherwise, it would always match just the minimum repeat count. But let's take a more realistic example from a previous post: finding the text between HTML "bold" tags:

	"<b>" .{0,INF} "</b>"

Note that INF is a keyword for use in the repeat count that means "infinite." So, this says to match "<b>", followed by zero or more repeats of any character, followed by "</b>". Because we wrote {0,INF} instead of {INF,0}, this is a lazy search; it will match each "<b>" to the very next "</b>", instead of to the last one in the document.

At this point, BOSS borrows a bit more syntax from RegEx: question mark to mean "zero or one", asterisk (star) to mean "zero or more", and plus to mean "one or more". So the above example could be written more simply as:

	"<b>" .* "</b>"

(Note that we're still working out a bit of magic with "*" and "+": we want these to be lazy in most cases, but if they are the last part of the pattern, then a lazy search makes little sense. So in that case, we want to do a greedy search.  Watch for more on this later.)

So far, we've shown patterns that are easily written as one-liners. But eventually you'll run into something more complex, that's better broken up into multiple lines. Let's take (again) the example of matching Pascal comments. In BOSS, this could be written like so:

	Comment = "(*" N* "*)"
	N = Comment | .

The first line defines a "Comment" as "(*", followed by zero or more middle bits we'll refer to as "N", followed by "*)". The next line defines the middle bits: either a complete comment, or any character. Note that BOSS uses vertical-bar to indicate "or", just like RegEx; but in BOSS, as in PEG, order matters. So if something inside a comment is itself a complete comment, that will match before the dot. The first two lines here define the recursion magic that allow nested comments to match properly. The last line, "Comment", is tho only line here that doesn't define a subpattern. So, that's what the pattern as a whole must match.

The features we've shown so far are enough to handle most common cases. But BOSS doesn't stop there. Like RegEx, BOSS supports character classes (e.g. [a-zA-Z_] to represent any English letter or underscore), and we'll include a good set of built-in Unicode-savvy character classes like WHITESPACE, DIGIT, and EOL — all descriptively named so you know what they do. Like PEG, it supports look-ahead predicates, that can require (or prohibit) something to be followed by something else. Going beyond either of these, BOSS will support WHERE clauses, that allow you to place additional conditions on a subpattern. Here's a quick example showing some of these features:

	Equipment = [A-F] (DIGIT{3} WHERE Value >= 150 AND Value <= 300)
	Labor = [LM] DIGIT{4}
	Equipment | Labor

This pattern matches part numbers for equipment or labor, where an equipment part number starts with a letter from A to F, followed by a three-digit number with a value from 150 to 300; and a labor part number is an L or M, followed by any 4-digit number.

Needless to say, we're pretty excited about all this.

We're prototyping the code for all this in REAL Studio, where we'll be able to use it for many of the real-world apps we write for ourselves and our clients. But once we're satisfied with the syntax and behavior, we'll re-implement it in clean, portable C code. That will be free, open-source software, with a liberal license, so it can be incorporated into all your favorite tools.

So, if you have any feedback on this new approach to string searching, now's the best time to weigh in! Please use the comment form below, or contact us directly, and let us know what you think.