Luminary Apps : Blog

Last week, we gave a sneak peek of BOSS, a new approach to string searching. We mentioned a "bit of magic" with regard to the repetition modifiers, * (0 or more) and + (one or more): these would do a lazy match, except at the end of the search pattern, in which case they would be greedy.

We expect this to be the most controversial feature of the whole BOSS design, so it's worth some time to explain all the considerations behind it, why we made this decision, and what the heck "lazy" and "greedy" mean when it comes to string searching anyway.

These repetition modifiers specify how many times part of a pattern may be matched. In BOSS, as in most modern RegEx implementations, you can be explicit: x{3} means to match x (whatever that might be) three times, no more, no less. If x doesn't occur at least three times, then this part of the pattern doesn't match at all. If x occurs more than three times, then this part of the pattern matches only the first three, and whatever comes next in the pattern (if anything) begins at the start of the fourth.

BOSS also lets you specify a range of times a pattern might match. But there are two ways to specify any range, depending on whether you prefer more matches or fewer. x{3,5} means to match only three times if possible, but extend this to four or five repetitions, if necessary to enable the rest of the pattern to match. Conversely, x{5,3} means to match five times if possible, but reduce this to four or three repetitions if needed to let the rest of the pattern match.

In software terms, x{3,5} is "lazy" because it prefers to do as little work as possible; x{5,3} is "greedy" because it will grab as much of the input as it can.

But now we come to the shortcut syntax, + and *. Consider * (star). This is supposed to match zero or more times, but does that mean {0,INF} or {INF,0}? Because these are likely to be common shortcuts, we want them to reflect the user's probably intent in as many cases as possible. So we have to examine some use cases.

Case 1. Find text between two enclosing tags.

This case arises, for example, when trying to scrape data out of HTML. Suppose you know the data you want starts with a '<div class="data">' tag, and of course ends with an "</div>" tag, with no other div tags in between. In this case, the natural way to write this would be:

	'<div class="data">' .* '</div>'

In this case, .* should be lazy, i.e. {0,INF}. If it were greedy, then instead of matching up to the next closing tag, it would match to the last closing tag in the document, which is almost certainly not what you want.

Case 2. Find a quoted string.

This one is similar to the above, but because the start and end delimiters are single characters, there is a better solution:

	'"' [^"]* '"'

With this pattern, the content characters are defined to be anything that is NOT a quotation mark. So the result doesn't depend on whether the repetition is greedy or lazy. But, not all users are so sophisticated; someone is certain to try this instead:

	'"' .* '"'

Now the content characters are defined to be anything at all. This will still work, if * is lazy; but if it's greedy, then you'll get an incorrect result spanning all the quoted strings in the document.

Case 3. Find C-style identifiers.

Many programming languages define identifiers (variable names, etc.) as a letter or underscore, followed by zero or more letters, underscores, and digits. You would write this in BOSS as follows:

	[a-zA-Z_] [a-zA-Z0-9_]*

In this case, there is nothing after the repetition operator, so it matters very much what style of matching we do. The lazy operatotr would be done before it even began, because matching zero repetitions would be perfectly valid, no matter what was in the input. On the other hand, the greedy one would continue to match letters, digits, and underscores as far as it can — which is exactly what we want.

From these examples, it looks like lazy matching is the correct thing to do when there is something more to the pattern, since the search engine will extend the match as much as necessary to make the rest of it match. I tried hard to think of even one example where a greedy search would be preferable in such a case, but couldn't come up with any. (And while we used * for our examples, similar considerations apply to + in the same way.)

On the other hand, when there isn't something more after the repetition, then it a lazy search is never the right thing — since that would always match the minimum (zero or one repeats for * and +, respectively). Always matching zero repeats is pointless, and if the user wanted to always match one, she wouldn't use the repetition operator at all.

Let's pause a moment and see how RegEx and PEG handle this. PEG doesn't have backtracking at all; so it has no way to start with a minimal match, then extend this as needed to match the rest of the pattern. As a result, its repeat operators are (by necessity) always greedy. But this means that any time you want to match a string of stuff delimited by other stuff, you have to code a forward not-assertion, something like this:

	'<div class="data">' (~'</div>'.)* '</div>'

...which is good enough once you get used to it, I suppose, but a real pain in the arse when you're just trying to bang out a quick search.

RegEx, on the other hand, does have backtracking. But I suspect that its designers considered something like Case 3 above, saw that repetition needed to be greedy in such a case, and decided to make it so across the board. Then Cases 1 and 2 became apparent, where the obvious (but naive) RegEx pattern falls down (or worse, appears to work in a simple test but fails on real data). So they added modifiers to allow for lazy matching. So instead of *, you can write *? to mean "zero or more times, lazy mode":

	<div class="data">.*?</div>

But this is unintuitive; virtually every RegEx user who knows this trick, learned it after getting bitten by the default behavior.

So in BOSS, we'd like to avoid biting people in the first place. So we define the behavior of * and + to match our use cases: greedy at the end of the pattern, and lazy when followed by anything else. (If you ever need the other behavior, you can just use the explicity repeat counts, e.g. {INF, 0}, which is a lot clearer than adding another special modifier character anyway.)

There is a cost to this decision, of course: the rule is a bit more complex to explain, and require a little more effort for those trying to form a rigorous mental model of the behavior. But the advantage is, for people who are not language lawyers and are instead trying to simply get a task done, it Just Works. We think the cost is worth this benefit.

But we also want to know what you think! Weigh in with your comments below.

Luminary Apps

Recent Blog Posts

Better to be lazy or greedy?