If you're not supposed to use Regular Expressions to parse HTML, then how are HTML parsers written?
Posted
by Andy E
on Stack Overflow
See other posts from Stack Overflow
or by Andy E
Published on 2010-03-08T10:30:52Z
Indexed on
2010/03/08
10:36 UTC
Read the original article
Hit count: 347
I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted).
This is rather confusing for me, I always thought that in general, the best way to parse any complicated string is to use a regular expression. So how does a HTML parser work? Doesn't it use regular expressions to parse.
One particular argument for using a regular expression is that there's not always a parsing alternative (such as JavaScript, where DOMDocument isn't a universally available option). jQuery, for instance, seems to manage just fine using a regex to convert a HTML string to DOM nodes.
Not sure whether or not to CW this, it's a genuine question that I want to be answered and not really intended to be a discussion thread.
© Stack Overflow or respective owner