Finding the most frequent subtrees in a collection of (parse) trees

Posted by peter.murray.rust on Stack Overflow See other posts from Stack Overflow or by peter.murray.rust
Published on 2009-11-06T03:30:38Z Indexed on 2010/05/18 4:00 UTC
Read the original article Hit count: 270

Filed under:

I have a collection of trees whose nodes are labelled (but not uniquely). Specifically the trees are from a collection of parsed sentences (see http://en.wikipedia.org/wiki/Treebank). I wish to extract the most common subtrees from the collection - performance is not (yet) an issue. I'd be grateful for algorithms (ideally Java) or pointers to tools which do this for treebanks. Note that order of child nodes is important.

EDIT @mjv. We are working in a limited domain (chemistry) which has a stylised language so the varirty of the trees is not huge - probably similar to children's readers. Simple tree for "the cat sat on the mat".

<sentence>
  <nounPhrase>
    <article/>
    <noun/>
  </nounPhrase>
  <verbPhrase>
    <verb/>
    <prepositionPhrase>
      <preposition/>
      <nounPhrase>
        <article/>
        <noun/>
      </nounPhrase>
    </prepositionPhrase>
  </verbPhrase>
</sentence>

Here the sentence contains two identical part-of-speech subtrees (the actual tokens "cat". "mat" are not important in matching). So the algorithm would need to detect this. Note that not all nounPhrases are identical - "the big black cat" could be:

      <nounPhrase>
        <article/>
        <adjective/>
        <adjective/>
        <noun/>
      </nounPhrase>

The length of sentences will be longer - between 15 to 30 nodes. I would expect to get useful results from 1000 trees. If this does not take more than a day or so that's acceptable.

Obviously the shorter the tree the more frequent, so nounPhrase will be very common.

EDIT If this is to be solved by flattening the tree then I think it would be related to Longest Common Substring, not Longest Common Sequence. But note that I don't necessarily just want the longest - I want a list of all those long enough to be "interesting" (criterion yet to be decided).

Developer IT

Finding the most frequent subtrees in a collection of (parse) trees - Developer IT

Finding the most frequent subtrees in a collection of (parse) trees

subtree

algorithm

tree

Related posts about subtree

Git subtree not properly using .gitignore when doing a partial clone

TinyXML Iterating over a Subtree

binary search tree recursive subtree in java

Git Subtree. Why can't I branch from a subtree rather than the root?

Using git subtree to clone a subdirectory of a project with versioning history then merge it back af

Related posts about algorithm

Jpeg Algorithm vs BMP Algorithm?

word disambiguation algorithm (Lesk algorithm)

Search algorithm (with a sort algorithm already implemented)

Is there any algorithm for finding LINES by PIXEL COLORS on picture?

collsion issues with quadtree [on hold]

Categories cloud