If you're not supposed to use Regular Expressions to parse HTML, then how are HTML parsers written?

Posted by Andy E on Stack Overflow See other posts from Stack Overflow or by Andy E
Published on 2010-03-08T10:30:52Z Indexed on 2010/03/08 10:36 UTC
Read the original article Hit count: 352

Filed under:
|

I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted).

This is rather confusing for me, I always thought that in general, the best way to parse any complicated string is to use a regular expression. So how does a HTML parser work? Doesn't it use regular expressions to parse.

One particular argument for using a regular expression is that there's not always a parsing alternative (such as JavaScript, where DOMDocument isn't a universally available option). jQuery, for instance, seems to manage just fine using a regex to convert a HTML string to DOM nodes.

Not sure whether or not to CW this, it's a genuine question that I want to be answered and not really intended to be a discussion thread.

© Stack Overflow or respective owner

Related posts about regex

Related posts about html