Parsing xml file that comes in as one object per line

Posted by Casey on Stack Overflow See other posts from Stack Overflow or by Casey
Published on 2014-08-08T13:28:51Z Indexed on 2014/08/22 16:22 UTC
Read the original article Hit count: 461

Filed under:

I haven't been here in so long, I forgot my prior account! Anyways, I am working on parsing an xml document that comes in ugly. It is for banking statements. Each line is a <statement>all tags</statement>. Now, what I need to do is read this file in, and parse the XML document at the same time, while formatting it more human readable too. Point beeing,

Original input looks like this:

<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>

I need the final output to be as follows:

<statement>
    <name></name>
    <address></address>
</statement>

This is fine and dandy. I am using the following "very slow considering 5.1 million lines, 254k data file, and about 60k statements takes around 8 minutes".

foreach(String item in lines)
{
    XElement xElement = XElement.Parse(item);
    sr.WriteLine(xElement.ToString().Trim());
}

Then when the file is formatted this is what sucks. I need to check every single tag in transaction elements, and if a tag is missing that could be there, I have to fill it in. Our designer software will default prior values in if a tag is possible, and the current objects does not have. It defaults in the value of a prior one that was not Null. "I know, and they swear up and down it is not a bug... ok?"

So, that is also taking about 5 to 10 minutes. I need to break all this down, and find a faster method for working with the initial XML. This is a preprocess action, and cannot take that long if not necessary. It just seems redundant.

Is there a better way to parse the XML, or is this the best I can do? I parse the XML, write to a temp file, and then read that file in, to the output file inserting the missing tags. 2 IO runs for one process. Yuck.

Developer IT

Parsing xml file that comes in as one object per line - Developer IT

Parsing xml file that comes in as one object per line

c#

.NET

Xml

io

Related posts about c#

.NET WebRequest.PreAuthenticate not quite what it sounds like

HttpWebRequest and Ignoring SSL Certificate Errors

The dynamic Type in C# Simplifies COM Member Access from Visual FoxPro

Dynamic Type to do away with Reflection

Finding a Relative Path in .NET

Related posts about .NET

Apt-Get Update: failure to fetch; can't connect to any sources

12.04: Apt-Get Update: failure to fetch; can't connect to any sources

What's New in ASP.NET 4

.NET Reflector 6, .NET Reflector Pro, TestDriven.NET, .NET 4.0 and Mono

Redmine on Apache2 with Passenger issue

Categories cloud