RegEx is handy. I use it all the time. For simple tasks, it's quite pleasant to use. For intermediate-sized tasks, it's acceptable. But for complex tasks, it is a nightmare to write, read, and maintain.
So, I'd like to suggest that it's time to design an alternative -- something that works just as well on complex tasks as it does for simple ones, and stays readable and maintainable. I agree with not reinventing the wheel... except when our current wheel is square and lumpy.
I'm not the first to think this, of course. It's been discussed at some length on c2.com and Stack Overflow (more than once). But as far as I can tell, nothing has yet come out of these discussions. Here's what I would love to see in a RegEx alternative:
1. An open-source library written in clean, portable C, so it is easily incorporated into other apps and languages.
2. A clean, readable syntax with few special characters.
3. At least as expressive as RegEx -- so if you can do it in RegEx, you ought to be able to do it in the new system too.
4. Especially good (i.e. easy to read and write) at complex patterns, but still good for simple ones.
5. Partial matching, with reporting of what can come next. This is handy when using it for validation: the user's input so far may not match the whole pattern, but it may be well on its way, and the system could tell the application programmer what possible characters can come next (to auto-fill or suggest for the user).
6. Conditional matching. For example, see this Stack Overflow question where the guy needed to limit the numeric part of an input to between 100 and 1100; that's really hard to do traditionally.
7. Can be easily extended to do lex/yacc type jobs; that is, you can get a callback or some such on each matched subpattern, so you can take some action (build a parse tree, do syntax coloring, or whatever).
This sounds like a tall order, but I think we can do it. The basic approach is to break a complex search pattern up into smaller parts, recursively, until we get down to very basic matching (exact characters, for example). That's clearly going to require some syntax.
Standard regex patterns try to avoid syntax, leaving most characters (including whitespace) to represent themselves directly. But syntax is unavoidable, so they define a (rather large) set of characters to be "special" and serve syntactical functions, as in '(.+?)'. This forces the use of some escaping mechanism for when you want to use those characters as themselves, and makes their meaning as syntax hard to remember. So in the new system, let's begin by deciding that characters that represent themselves must be quoted as a string literal; that frees us to use anything else (including whitespace) to make our expression readable and natural.
Having decided that, there are only a few zillion details to hammer out. So let's dive in with some examples. First, suppose we want to match a C-style identifier, that is, a letter or underscore, followed by a string of letters, digits, and underscores. We might write that this way:
startChar = Letter or `_` afterChar = startChar or Digit startChar afterChar{0-}
What's all this, then? The first two lines here define subpatterns, called "startChar" and "afterChar"; the last line defines what the input needs to actually match (by invoking the previously-defined subpatterns). The curly braces, similar to RegEx, define how many times a pattern can repeat; {2-4} would mean from 2 to 4 times, and {0-} means zero or more.
"Letter" and "Digit" are built-in character classes that match what you'd expect them to match. "or" is an operator that means to match either of the alternatives; we could perhaps use "|" or "||" for this, but I tend to favor the English term. Finally, `_` is a string literal; this matches the underscore character.
This is a simple example, and not something that would be too hard to write in standard RegEx. Here's a better one: it matches a phone number, with optional area code:
areaCode = Optional(`(`) Digit{3} Optional(`)`) sep = Optional(`-` or `.` or ` `) Optional(areaCode) sep Digit{3} sep Digit{4}
Looking at the last line here, you can see that a phone number is an optional area code, a separator (which is always optional), three digits, a separator, and four more digits. "Optional" here is a function that is equivalent to putting {0-1} after its argument; it's just a little more readable way of saying that something is not required. All very clear and readable, isn't it? (For reference, the equivalent RegEx pattern would be "(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})".)
Finally, let's look at constraints. Suppose we want to match a part number, defined as either an "A" followed by a number from 150 to 250, or a B followed by a number from 200 to 325.
partA = (`A` or `a`) (Digit{3} where Value >= 150 and Value <= 250) partB = (`B` or `b`) (Digit{3} where Value >= 200 and Value <= 325) partA or partB
This adds a "where" clause to the second expression for each part. Where clauses let you put various constraints on the match; Value would be a common one, that converts the matched text to a number and then lets you do a numeric comparison. So the three digits found on the first line would be allowed to match only if their value is between 150 and 250 (inclusive).
Finally, it should be pointed out that you wouldn't have to break your expression up into multiple lines, as in the examples above. You can always do it as a one-liner too, as in this example, which matches a C-style character literal:
`'` Anything{1-} `'`
This is all sheer speculation at this point; we haven't attempted to implement it yet. Some of our goals might be challenging (e.g., making where clauses work with partial matching). But I'm pretty sure most of it is doable. If done well, the result will be a pattern-matching library that lends itself to much clearer, more maintainable code than is typical today, and which can do things standard RegEx can't easily do.
What do you think? Genius or insanity? Weigh in with your thoughts below.