Indexing and Searching Over Word Level Annotation Layers in Lucene

Posted by dmcer on Stack Overflow See other posts from Stack Overflow or by dmcer
Published on 2010-05-21T14:37:32Z Indexed on 2010/05/21 14:40 UTC
Read the original article Hit count: 324

Filed under:

I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store, the annotations might look like:


Word  POS  Chunk       NER
====  ===  =====  ========
The    DT     NP    Person     
man    NN     NP    Person
went  VBD     VP         -
to     TO     PP         - 
the    DT     NP  Location
store  NN     NP  Location

I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where Washington is tagged as a person. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows:

Query: Word=Washington,NER=Person

I'd also like to do more complex queries involving the sequential order of annotations across different layers, e.g. find all the documents where there's a word tagged person followed by the words arrived at followed by a word tagged location. Such a query might look like:

Query: "NER=Person Word=arrived Word=at NER=Location"

What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens?

Developer IT

Indexing and Searching Over Word Level Annotation Layers in Lucene - Developer IT

Indexing and Searching Over Word Level Annotation Layers in Lucene

lucene

data-mining

text-mining

nlp

java

Related posts about lucene

performance comparision between Zend Lucene and Java Lucene

Why wasn't fast-vector-highlighter (lucene-contrib) made an official part of Lucene 3.0 core

pylucene: install error

Solr WordDelimiterFilter + Lucene Highlighter

java AbstractMethodError

Related posts about data-mining

SQLAuthority News – Links to Book On Line – Data Mining Algorithms (Analysis Services – Data Mining)

Data Mining Resources

Integrating Data Mining into your BI Solution (Presentation)

What data mining tools do you use?

NEW 2-Day Instructor Led Course on Oracle Data Mining Now Available!

Categories cloud