How to use linux csplit to chop up massive XML file?

Posted by Fred on Stack Overflow See other posts from Stack Overflow or by Fred
Published on 2010-05-13T22:28:26Z Indexed on 2010/05/13 22:34 UTC
Read the original article Hit count: 212

Filed under:
|
|

Hi everyone, I have a gigantic (4GB) XML file that I am currently breaking into chunks with linux "split" function (every 25,000 lines - not by bytes). This usually works great (I end up with about 50 files), except some of the data descriptions have line breaks, and so frequently the chunk files do not have the proper closing tags - and my parser chokes halfway through processing.

Example file: (note: normally each "listing" xml node is supposed to be on its own line)

<?xml version="1.0" encoding="UTF-8"?>
<listings>
<listing><date>2009-09-22</date><desc>This is a description WITHOUT line breaks and works fine with split</desc><more_tags>stuff</more_tags></listing>
<listing><date>2009-09-22</date><desc>This is a really
annoying description field
WITH line breaks 
that screw the split function</desc><more_tags>stuff</more_tags></listing>
</listings>

Then sometimes my split ends up like

<?xml version="1.0" encoding="UTF-8"?>
<listings>
<listing><date>2009-09-22</date><desc>This is a description WITHOUT line breaks and works fine with split</desc><more_tags>stuff</more_tags></listing>
<listing><date>2009-09-22</date><desc>This is a really
annoying description field
WITH line breaks ... 
EOF

So - I have been reading about "csplit" and it sounds like it might work to solve this issue. I cant seem to get the regular expression right...

Basically I want the same output of ~50ish files

Something like:

*csplit -k myfile.xml '/</listing>/' 25000 {50}

Any help would be great Thanks!

© Stack Overflow or respective owner

Related posts about Xml

Related posts about split