Writing more efficient xquery code (avoiding redundant iteration)
- by Coquelicot
Here's a simplified version of a problem I'm working on: I have a bunch of xml data that encodes information about people. Each person is uniquely identified by an 'id' attribute, but they may go by many names. For example, in one document, I might find
<person id=1>Paul Mcartney</person>
<person id=2>Ringo Starr</person>
And in another I might find:
<person id=1>Sir Paul McCartney</person>
<person id=2>Richard Starkey</person>
I want to use xquery to produce a new document that lists every name associated with a given id. i.e.:
<person id=1>
<name>Paul McCartney</name>
<name>Sir Paul McCartney</name>
<name>James Paul McCartney</name>
</person>
<person id=2>
...
</person>
The way I'm doing this now in xquery is something like this (pseudocode-esque):
let $ids := distinct-terms( [all the id attributes on people] )
for $id in $ids
return <person id={$id}>
{
for $unique-name in distinct-values
(
for $name in ( [all names] )
where $name/@id=$id
return $name
)
return <name>{$unique-name}</name>
}
</person>
The problem is that this is really slow. I imagine the bottleneck is the innermost loop, which executes once for every id (of which there are about 1200). I'm dealing with a fair bit of data (300 MB, spread over about 800 xml files), so even a single execution of the query in the inner loop takes about 12 seconds, which means that repeating it 1200 times will take about 4 hours (which might be optimistic - the process has been running for 3 hours so far). Not only is it slow, it's using a whole lot of virtual memory. I'm using Saxon, and I had to set java's maximum heap size to 10 GB (!) to avoid getting out of memory errors, and it's currently using 6 GB of physical memory.
So here's how I'd really like to do this (in Pythonic pseudocode):
persons = {}
for id in ids:
person[id] = set()
for person in all_the_people_in_my_xml_document:
persons[person.id].add(person.name)
There, I just did it in linear time, with only one sweep of the xml document. Now, is there some way to do something similar in xquery? Surely if I can imagine it, a reasonable programming language should be able to do it (he said quixotically). The problem, I suppose, is that unlike Python, xquery doesn't (as far as I know) have anything like an associative array.
Is there some clever way around this? Failing that, is there something better than xquery that I might use to accomplish my goal? Because really, the computational resources I'm throwing at this relatively simple problem are kind of ridiculous.