Using lookahead assertions in regular expressions
- by Greg Jackson
I use regular expressions on a daily basis, as my daily work is 90% in Perl (legacy codebase, but that's a different issue). Despite this, I still find lookahead and lookbehind to be terribly confusing and often unreadable. Right now, if I were to get a code review with a lookahead or lookbehind, I would immediately send it back to see if the problem can be solved by using multiple regular expressions or a different approach. The following are the main reasons I tend not to like them:
They can be terribly unreadable. Lookahead assertions, for example, start from the beginning of the string no matter where they are placed. That, among other things, can cause some very "interesting" and non-obvious behaviors.
It used to be the case that many languages didn't support lookahead/lookbehind (or supported them as "experimental features"). This isn't the case quite as much, but there's still always the question as to how well it's supported.
Quite frankly, they feel like a dirty hack. Regexps often already are, but they can also be quite elegant, and have gained widespread acceptance.
I've gotten by without any need for them at all... sometimes I think that they're extraneous.
Now, I'll freely admit that especially the last two reasons aren't really good ones, but I felt that I should enumerate what goes through my mind when I see one. I'm more than willing to change my mind about them, but I feel that they violate some of my core tenets of programming, including:
Code should be as readable as possible without sacrificing functionality -- this may include doing something in a less efficient, but clearer was as long as the difference is negligible or unimportant to the application as a whole.
Code should be maintainable -- if another programmer comes along to fix my code, non-obvious behavior can hide bugs or make functional code appear buggy (see readability)
"The right tool for the right job" -- I'm sure you can come up with contrived examples that could use lookahead, but I've never come across something that really needs them in my real-world development work. Is there anything that they're really the best tool for, as opposed to, say, multiple regexps (or, alternatively, are they the best tool for most cases they're used for today).
My question is this: Is it good practice to use lookahead/lookbehind in regular expressions, or are they simply a hack that have found their way into modern production code?
I'd be perfectly happy to be convinced that I'm wrong about this, and simple examples are useful for examples or illustration, but by themselves, won't be enough to convince me.