How to parse invalid HTML with Perl?
- by bodacydo
I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote articles didn't know proper HTML, so they often have written stuff like:
<div class="highlight"><html><head></head><body><p>Note that ...</p></html></div>
I tried using HTML::TreeBuilder to parse this HTML but after parsing it and dumping the resulting tree, all the elements between <div class="highlight">...</div> are gone. I'm left with just <div class="highlight"></div>.
The editors often have also done things like:
<div class="article"><style>@font-face { font-family: "Cambria"; }</style>Article starts here</div>
Parsing this with HTML::TreeBuilder results in empty <div class="article"></div> again.
Any ideas how to approach this broken HTML and actually make sense out of it?