converting a treebank of vertical trees to s-expressions
- by Andreas
I need to preprocess a treebank corpus of sentences with parse trees. The input format is a vertical representation of trees, like so:
S
=NP
==(DT +def) the
== (N +ani) man
=VP
==V walks
...and I need it like:
(S (NP (DT the) (N man)) (VP (V walks)))
I have code that almost does it, but not quite. There's always a missing paren somewhere. Should I use a proper parser, maybe a CFG? The current code is at http://github.com/andreasvc/eodop/blob/master/arbobanko.py
The code also contains real examples from the treebank.