Indexing and Searching Over Word Level Annotation Layers in Lucene
Posted
by dmcer
on Stack Overflow
See other posts from Stack Overflow
or by dmcer
Published on 2010-05-21T14:37:32Z
Indexed on
2010/05/21
14:40 UTC
Read the original article
Hit count: 272
I have a data set with multiple layers of annotation over the underlying text, such as part-of-tags, chunks from a shallow parser, name entities, and others from various natural language processing (NLP) tools. For a sentence like The man went to the store
, the annotations might look like:
Word POS Chunk NER ==== === ===== ======== The DT NP Person man NN NP Person went VBD VP - to TO PP - the DT NP Location store NN NP Location
I'd like to index a bunch of documents with annotations like these using Lucene and then perform searches across the different layers. An example of a simple query would be to retrieve all documents where Washington is tagged as a person. While I'm not absolutely committed to the notation, syntactically end-users might enter the query as follows:
Query: Word=Washington,NER=Person
I'd also like to do more complex queries involving the sequential order of annotations across different layers, e.g. find all the documents where there's a word tagged person followed by the words arrived at
followed by a word tagged location. Such a query might look like:
Query: "NER=Person Word=arrived Word=at NER=Location"
What's a good way to go about approaching this with Lucene? Is there anyway to index and search over document fields that contain structured tokens?
© Stack Overflow or respective owner