XmlWriter and lower ASCII characters

Posted by Rick Strahl on West-Wind See other posts from West-Wind or by Rick Strahl
Published on Mon, 02 Jan 2012 10:36:44 GMT Indexed on 2012/03/18 17:58 UTC
Read the original article Hit count: 321

Filed under:
|

Ran into an interesting problem today on my CodePaste.net site: The main RSS and ATOM feeds on the site were broken because one code snippet on the site contained a lower ASCII character (CHR(3)). I don't think this was done on purpose but it was enough to make the feeds fail.

After quite a bit of debugging and throwing in a custom error handler into my actual feed generation code that just spit out the raw error instead of running it through the ASP.NET MVC and my own error pipeline I found the actual error.

The lovely base exception and error trace I got looked like this:

Error: '', hexadecimal value 0x03, is an invalid character.


at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
at System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
at System.Xml.XmlWellFormedWriter.WriteString(String text)
at System.Xml.XmlWriter.WriteElementString(String localName, String ns, String value)
at System.ServiceModel.Syndication.Rss20FeedFormatter.WriteItemContents(XmlWriter writer, SyndicationItem item, Uri feedBaseUri)
at System.ServiceModel.Syndication.Rss20FeedFormatter.WriteItem(XmlWriter writer, SyndicationItem item, Uri feedBaseUri)
at System.ServiceModel.Syndication.Rss20FeedFormatter.WriteItems(XmlWriter writer, IEnumerable`1 items, Uri feedBaseUri)
at System.ServiceModel.Syndication.Rss20FeedFormatter.WriteFeed(XmlWriter writer)
at System.ServiceModel.Syndication.Rss20FeedFormatter.WriteTo(XmlWriter writer)
at CodePasteMvc.Controllers.ApiControllerBase.GetFeed(Object instance) in C:\Projects2010\CodePaste\CodePasteMvc\Controllers\ApiControllerBase.cs:line 131

XML doesn't like extended ASCII Characters

It turns out the issue is that XML in general does not deal well with lower ASCII characters. According to the XML spec it looks like any characters below 0x09 are invalid. If you generate an XML document in .NET with an embedded  entity (as mine did to create the error above), you tend to get an XML document error when displaying it in a viewer. For example, here's what the result of my  feed output looks like with the invalid character embedded inside of Chrome which displays RSS feeds as raw XML by default:

ChromeError

Other browsers show similar error messages. The nice thing about Chrome is that you can actually view source and jump down to see the line that causes the error which allowed me to track down the actual message that failed.

If you create an XML document that contains a 0x03 character the XML writer fails outright with the error:

'', hexadecimal value 0x03, is an invalid character.

The good news is that this behavior is overridable so XML output can at least be created by using the XmlSettings object when configuring the XmlWriter instance. In my RSS configuration code this looks something like this:

MemoryStream ms = new MemoryStream();
var settings = new XmlWriterSettings()
{
    CheckCharacters = false
};
XmlWriter writer = XmlWriter.Create(ms,settings);

and voila the feed now generates.

Now generally this is probably NOT a good idea, because as mentioned above these characters are illegal and if you view a raw XML document you'll get validation errors. Luckily though most RSS feed readers however don't care and happily accept and display the feed correctly, which is good because it got me over an embarrassing hump until I figured out a better solution.

How to handle extended Characters?

I was glad to get the feed fixed for the time being, but now I was still stuck with an interesting dilemma. CodePaste.net accepts user input for code snippets and those code snippets can contain just about anything. This means that ASP.NET's standard request filtering cannot be applied to this content. The code content displayed is encoded before display so for the HTML end the CHR(3) input is not really an issue.

While invisible characters are hardly useful in user input it's not uncommon that odd characters show up in code snippets. You know the old fat fingering that happens when you're in the middle of a coding session and those invisible characters do end up sometimes in code editors and then end up pasted into the HTML textbox for pasting as a Codepaste.net snippet.

The question is how to filter this text? Looking back at the XML Charset Spec it looks like all characters below 0x20 (space) except for 0x09 (tab), 0x0A (LF), 0x0D (CR) are illegal. So applying the following filter with a RegEx should work to remove invalid characters:

string code = Regex.Replace(item.Code, @"[\u0000-\u0008,\u000B,\u000C,\u000E-\u001F]", "");

Applying this RegEx to the code snippet (and title) eliminates the problems and the feed renders cleanly.

© Rick Strahl, West Wind Technologies, 2005-2012
Posted in .NET  XML  

© West-Wind or respective owner

Related posts about .NET

Related posts about Xml