Writing more efficient xquery code (avoiding redundant iteration)

Posted by Coquelicot on Stack Overflow See other posts from Stack Overflow or by Coquelicot
Published on 2010-05-13T01:50:39Z Indexed on 2010/05/13 1:54 UTC
Read the original article Hit count: 398

Filed under:
|

Here's a simplified version of a problem I'm working on: I have a bunch of xml data that encodes information about people. Each person is uniquely identified by an 'id' attribute, but they may go by many names. For example, in one document, I might find

<person id=1>Paul Mcartney</person>
<person id=2>Ringo Starr</person>

And in another I might find:

<person id=1>Sir Paul McCartney</person>
<person id=2>Richard Starkey</person>

I want to use xquery to produce a new document that lists every name associated with a given id. i.e.:

<person id=1>
    <name>Paul McCartney</name>
    <name>Sir Paul McCartney</name>
    <name>James Paul McCartney</name>
</person>
<person id=2>
    ...
</person>

The way I'm doing this now in xquery is something like this (pseudocode-esque):

let $ids := distinct-terms( [all the id attributes on people] )
for $id in $ids
    return <person id={$id}>
    {
    for $unique-name in distinct-values
            (
            for $name in ( [all names] )
            where $name/@id=$id
            return $name
            )
        return <name>{$unique-name}</name>
    }
    </person>

The problem is that this is really slow. I imagine the bottleneck is the innermost loop, which executes once for every id (of which there are about 1200). I'm dealing with a fair bit of data (300 MB, spread over about 800 xml files), so even a single execution of the query in the inner loop takes about 12 seconds, which means that repeating it 1200 times will take about 4 hours (which might be optimistic - the process has been running for 3 hours so far). Not only is it slow, it's using a whole lot of virtual memory. I'm using Saxon, and I had to set java's maximum heap size to 10 GB (!) to avoid getting out of memory errors, and it's currently using 6 GB of physical memory.

So here's how I'd really like to do this (in Pythonic pseudocode):

persons = {}
for id in ids:
    person[id] = set()
for person in all_the_people_in_my_xml_document:
    persons[person.id].add(person.name)

There, I just did it in linear time, with only one sweep of the xml document. Now, is there some way to do something similar in xquery? Surely if I can imagine it, a reasonable programming language should be able to do it (he said quixotically). The problem, I suppose, is that unlike Python, xquery doesn't (as far as I know) have anything like an associative array.

Is there some clever way around this? Failing that, is there something better than xquery that I might use to accomplish my goal? Because really, the computational resources I'm throwing at this relatively simple problem are kind of ridiculous.

© Stack Overflow or respective owner

Related posts about Xml

Related posts about xquery