Indexing and Searching Over Word Level Annotation Layers in Lucene

Posted by dmcer on Stack Overflow See other posts from Stack Overflow or by dmcer
Published on 2010-05-21T14:37:32Z Indexed on 2010/05/21 14:40 UTC
Read the original article Hit count: 276

Filed under:
|
|
|
|

I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store, the annotations might look like:


Word  POS  Chunk       NER
====  ===  =====  ========
The    DT     NP    Person     
man    NN     NP    Person
went  VBD     VP         -
to     TO     PP         - 
the    DT     NP  Location
store  NN     NP  Location

I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where Washington is tagged as a person. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows:

Query: Word=Washington,NER=Person

I'd also like to do more complex queries involving the sequential order of annotations across different layers, e.g. find all the documents where there's a word tagged person followed by the words arrived at followed by a word tagged location. Such a query might look like:

Query: "NER=Person Word=arrived Word=at NER=Location"

What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens?

© Stack Overflow or respective owner

Related posts about lucene

Related posts about data-mining