NLP - Queries using semantic wildcards in full text searching, maybe with Lucene?

Posted by Zsolt on Programmers See other posts from Programmers or by Zsolt
Published on 2012-11-25T23:56:21Z Indexed on 2012/11/26 5:19 UTC
Read the original article Hit count: 535

Filed under:
|
|
|
|

Let's say I have a big corpus (for example in english or an arbitrary language), and I want to perform some semantic search on it. For example I have the query:

"Be careful: [art] armada of [sg] is coming to [do sg]!"

And the corpus contains the following sentence:

"Be careful: an armada of alien ships is coming to destroy our planet!"

It can be seen that my query string could contain "semantic placeholders", such as:

[art] - some placeholder for articles (for example a / an in English) [sg], [do sg] - some placeholders for NPs and VPs (subjects and predicates) I would like to develop a library which would be capable to handle these queries efficiently. I suspect that some kind of POS-tagging would be necessary for parsing the text, but because I don't want to fully reimplement an already existing full-text search engine to make it work, I'm considering that how could I integrate this behaviour into a search engine like Lucene?

I know there are SpanQueries which could behave similarly in some cases, but as I can see, Lucene doesn't do any semantic stuff with stored texts.

It is possible to implement a behavior like this? Or do I have to write an own search engine?

© Programmers or respective owner

Related posts about algorithms

Related posts about search-engine