If you're not supposed to use Regular Expressions to parse HTML, then how are HTML parsers written?
- by Andy E
I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted).
This is rather confusing for me, I always thought that in general, the best way to parse any complicated string is to use a regular expression. So how does a HTML parser work? Doesn't it use regular expressions to parse.
One particular argument for using a regular expression is that there's not always a parsing alternative (such as JavaScript, where DOMDocument isn't a universally available option). jQuery, for instance, seems to manage just fine using a regex to convert a HTML string to DOM nodes.
Not sure whether or not to CW this, it's a genuine question that I want to be answered and not really intended to be a discussion thread.