XML: Process large data
- by Atmocreations
Hello
What XML-parser do you recommend for the following purpose:
The XML-file (formatted, containing whitespaces) is around 800 MB. It mostly contains three types of tag (let's call them n, w and r).
They have an attribute called id which i'd have to search for, as fast as possible.
Removing attributes I don't need could save around 30%, maybe a bit more.
First part for optimizing the second part: Is there any good tool (command line linux and windows if possible) to easily remove unused attributes in certain tags? I know that XSLT could be used. Or are there any easy alternatives? Also, I could split it into three files, one for each tag to gain speed for later parsing...
Speed is not too important for this preparation of the data, of course it would be nice when it took rather minutes than hours.
Second part: Once I have the data prepared, be it shortened or not, I should be able to search for the ID-attribute I was mentioning, this being time-critical.
Estimations using wc -l tell me that there are around 3M N-tags and around 418K W-tags. The latter ones can contain up to approximately 20 subtags each. W-Tags also contain some, but they would be stripped away.
"All I have to do" is navigating between tags containing certain id-attributes. Some tags have references to other id's, therefore giving me a tree, maybe even a graph. The original data is big (as mentioned), but the resultset shouldn't be too big as I only have to pick out certain elements.
Now the question: What XML parsing library should I use for this kind of processing? I would use Java 6 in a first instance, with having in mind to be porting it to BlackBerry.
Might it be useful to just create a flat file indexing the id's and pointing to an offset in the file? Is it even necessary to do the optimizations mentioned in the upper part? Or are there parser known to be quite as fast with the original data?
Little note: To test, I took the id being on the very last line on the file and searching for the id using grep. This took around a minute on a Core 2 Duo.
What happens if the file grows even bigger, let's say 5 GB?
I appreciate any notice or recommendation.
Thank you all very much in advance and regards