Indexing and Searching Over Word Level Annotation Layers in Lucene
- by dmcer
I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store, the annotations might look like:
Word POS Chunk NER
==== === ===== ========
The DT NP Person
man NN NP Person
went VBD VP -
to TO PP -
the DT NP Location
store NN NP Location
I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where Washington is tagged as a person. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows:
Query: Word=Washington,NER=Person
I'd also like to do more complex queries involving the sequential order of annotations across different layers, e.g. find all the documents where there's a word tagged person followed by the words arrived at followed by a word tagged location. Such a query might look like:
Query: "NER=Person Word=arrived Word=at NER=Location"
What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens?