Can this be improved? Scrubbing of dangerous html tags.

Posted by chobo2 on Stack Overflow See other posts from Stack Overflow or by chobo2
Published on 2010-06-10T22:53:12Z Indexed on 2010/06/10 23:23 UTC
Read the original article Hit count: 331

Filed under:

c#

|

.NET

|

JavaScript

|

security

I been finding that for something that I consider pretty import there is very little information or libraries on how to deal with this problem.

I found this while searching. I really don't know all the million ways that a hacker could try to insert the dangerous tags.

I have a rich html editor so I need to keep non dangerous tags but strip out bad ones.

So is this script missing anything?

It uses html agility pack.

public string ScrubHTML(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    //Remove potentially harmful elements
    HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.ParentNode.RemoveChild(node, false);

        }
    }

    //remove hrefs to java/j/vbscript URLs
    nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
    if (nc != null)
    {

        foreach (HtmlNode node in nc)
        {
            node.SetAttributeValue("href", "#");
        }
    }


    //remove img with refs to java/j/vbscript URLs
    nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.SetAttributeValue("src", "#");
        }
    }

    //remove on<Event> handlers from all tags
    nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.Attributes.Remove("onFocus");
            node.Attributes.Remove("onBlur");
            node.Attributes.Remove("onClick");
            node.Attributes.Remove("onMouseOver");
            node.Attributes.Remove("onMouseOut");
            node.Attributes.Remove("onDoubleClick");
            node.Attributes.Remove("onLoad");
            node.Attributes.Remove("onUnload");
        }
    }

    // remove any style attributes that contain the word expression (IE evaluates this as script)
    nc = doc.DocumentNode.SelectNodes("//*[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.Attributes.Remove("stYle");
        }
    }

    return doc.DocumentNode.WriteTo();
}

Edit

2 people have suggested whitelisting. I actually like the idea of whitelisting but never actually did it because no one can actually tell me how to do it in C# and I can't even really find tutorials for how to do it in c#(the last time I looked. I will check it out again).

How do you make a white list? Is it just a list collection?
How do you actual parse out all html tags, script tags and every other tag?
Once you have the tags how do you determine which ones are allowed? Compare them to you list collection? But what happens if the content is coming in and has like 100 tags and you have 50 allowed. You got to compare each of those 100 tag by 50 allowed tags. Thats quite a bit to go through and could be slow.
Once you found a invalid tag how do you remove it? I don't really want to reject a whole set of text if one tag was found to be invalid. I rather remove and insert the rest.
Should I be using html agility pack?

© Stack Overflow or respective owner

Related posts about c#

.NET WebRequest.PreAuthenticate not quite what it sounds like

as seen on West-Wind - Search for 'West-Wind'
I’ve run into the problem a few times now: How to pre-authenticate .NET WebRequest calls doing an HTTP call to the server – essentially send authentication credentials on the very first request instead of waiting for a server challenge first? At first glance this sound like it should be easy:… >>> More
HttpWebRequest and Ignoring SSL Certificate Errors

as seen on West-Wind - Search for 'West-Wind'
Man I can't believe this. I'm still mucking around with OFX servers and it drives me absolutely crazy how some these servers are just so unbelievably misconfigured. I've recently hit three different 3 major brokerages which fail HTTP validation with bad or corrupt certificates at least according to… >>> More
The dynamic Type in C# Simplifies COM Member Access from Visual FoxPro

as seen on West-Wind - Search for 'West-Wind'
I’ve written quite a bit about Visual FoxPro interoperating with .NET in the past both for ASP.NET interacting with Visual FoxPro COM objects as well as Visual FoxPro calling into .NET code via COM Interop. COM Interop with Visual FoxPro has a number of problems but one of them at least got a lot… >>> More
Dynamic Type to do away with Reflection

as seen on West-Wind - Search for 'West-Wind'
The dynamic type in C# 4.0 is a welcome addition to the language. One thing I’ve been doing a lot with it is to remove explicit Reflection code that’s often necessary when you ‘dynamically’ need to walk and object hierarchy. In the past I’ve had a number of ReflectionUtils that used string based expressions… >>> More
Finding a Relative Path in .NET

as seen on West-Wind - Search for 'West-Wind'
Here’s a nice and simple path utility that I’ve needed in a number of applications: I need to find a relative path based on a base path. So if I’m working in a folder called c:\temp\templates\ and I want to find a relative path for c:\temp\templates\subdir\test.txt I want to receive back subdir\test… >>> More

Related posts about .NET

Apt-Get Update: failure to fetch; can't connect to any sources

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I realize there are dozens of "apt-get update: failure to fetch" questions (I read through all I could find), but my present circumstance is unique to 12.04 and it affects all sources; not just launchpad. Additionally, I've tried several different servers in Europe and the U.S. as well as the "main… >>> More
12.04: Apt-Get Update: failure to fetch; can't connect to any sources

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I realize there are dozens of "apt-get update: failure to fetch" questions (I read through all I could find), but my present circumstance is unique to 12.04 and it affects all sources; not just launchpad. Additionally, I've tried several different servers in Europe and the U.S. as well as the "main… >>> More
What's New in ASP.NET 4

as seen on Geeks with Blogs - Search for 'Geeks with Blogs'
The .NET Framework version 4 includes enhancements for ASP.NET 4 in targeted areas. Visual Studio 2010 and Microsoft Visual Web Developer Express also include enhancements and new features for improved Web development. This document provides an overview of many of the new features that are included… >>> More
.NET Reflector 6, .NET Reflector Pro, TestDriven.NET, .NET 4.0 and Mono

as seen on Simple Talk - Search for 'Simple Talk'
By now you may well have noticed that .NET Reflector 6 and .NET Reflector Pro are out in the wild. The official launch happened today, although we actually put the software out last Thursday as part of a phased release plan to ensure that everything went smoothly today which, so far, it seems to have… >>> More
Redmine on Apache2 with Passenger issue

as seen on Server Fault - Search for 'Server Fault'
I installed Redmine and run it in Apache2 with the Passenger module. Apache2 boots, Passenger module gets loaded and the Redmine welcome page is shown, however when trying to login or navigate to other parts of the Redmine site, the browser keeps loading and loading and loading forever, although the… >>> More