HtmlAgilityPack giving problems with malformed html

Posted by Kapil on Stack Overflow See other posts from Stack Overflow or by Kapil
Published on 2010-05-31T14:18:26Z Indexed on 2010/05/31 14:23 UTC
Read the original article Hit count: 642

Filed under:

I want to extract meaningful text out of an html document and I was using html-agility-pack for the same. Here is my code:

string convertedContent = HttpUtility.HtmlDecode(ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString)));

ConvertHtml:

    public string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

ConvertTo:

    public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText)
    {
        string html;
        switch (node.NodeType)
        {
            case HtmlAgilityPack.HtmlNodeType.Comment:
                // don't output comments
                break;

            case HtmlAgilityPack.HtmlNodeType.Document:
                foreach (HtmlNode subnode in node.ChildNodes)
                {
                  ConvertTo(subnode, outText);
                }
                break;

            case HtmlAgilityPack.HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                    break;

                // get text
                html = ((HtmlTextNode)node).Text;

                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                    break;

                // check the text is meaningful and not a bunch of whitespaces
                if (html.Trim().Length > 0)
                {
                    outText.Write(HtmlEntity.DeEntitize(html) + " ");
                }
                break;

            case HtmlAgilityPack.HtmlNodeType.Element:
                switch (node.Name)
                {
                    case "p":
                        // treat paragraphs as crlf
                        outText.Write("\r\n");
                        break;
                }

                if (node.HasChildNodes)
                {
                foreach (HtmlNode subnode in node.ChildNodes)
                 {
                  ConvertTo(subnode, outText);
                 }
                }
                break;
        }
    }

Now in some cases when the html pages are malformed (for example the following page - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html has a malformed meta-tag like <meta content="text/html; charset=uft-8" http-equiv="Content-Type">) [Note "uft" instead of utf] my code is puking at the time I am trying to load the html document.

Can someone suggest me how can I overcome these malformed html pages and still extract relevant text out of a html document?

Thanks, Kapil

© Stack Overflow or respective owner

Related posts about htmlagilitypack