How to parse invalid HTML with Perl?
Posted
by
bodacydo
on Stack Overflow
See other posts from Stack Overflow
or by bodacydo
Published on 2012-07-04T21:12:41Z
Indexed on
2012/07/04
21:15 UTC
Read the original article
Hit count: 842
I maintain a database of articles with HTML formatting. Unfortunately the editors who wrote articles didn't know proper HTML, so they often have written stuff like:
<div class="highlight"><html><head></head><body><p>Note that ...</p></html></div>
I tried using HTML::TreeBuilder
to parse this HTML but after parsing it and dumping the resulting tree, all the elements between <div class="highlight">...</div>
are gone. I'm left with just <div class="highlight"></div>
.
The editors often have also done things like:
<div class="article"><style>@font-face { font-family: "Cambria"; }</style>Article starts here</div>
Parsing this with HTML::TreeBuilder
results in empty <div class="article"></div>
again.
Any ideas how to approach this broken HTML and actually make sense out of it?
© Stack Overflow or respective owner