Lucene - querying with long strings

Posted by Mikos on Stack Overflow See other posts from Stack Overflow or by Mikos
Published on 2010-03-23T21:20:57Z Indexed on 2010/03/23 21:23 UTC
Read the original article Hit count: 601

Filed under:
|
|

I have an index, with a field "Affiliation", some example values are:

  • "Stanford University School of Medicine, Palo Alto, CA USA",
  • "Institute of Neurobiology, School of Medicine, Stanford University, Palo Alto, CA",
  • "School of Medicine, Harvard University, Boston MA",
  • "Brigham & Women's, Harvard University School of Medicine, Boston, MA"
  • "Harvard University, Cambridge MA"

and so on... (the bottom-line being the affiliations are written in multiple ways with no apparent consistency)

I query the index on the affiliation field using say "School of Medicine, Stanford University, Palo Alto, CA" (with QueryParser) to find all Stanford related documents, I get a lot of false +ves, presumably because of the presence of School of Medicine etc. etc. (note: I cannot use Phrase query because of variability in the way affiliation is constructed)

I have tried the following:

  1. Use a SpanNearQuery by splitting the search phrase with a whitespace (here I get no results!)

  2. Tried boosting (using ^) by splitting with the comma and boosting the last parts such as "Palo Alto CA" with a much higher boost than the initial phrases. Here I still get lots of false +ves.

Any suggestions on how to approach this? If SpanNearQuery the way to go, Any ideas on why I get 0 results?

© Stack Overflow or respective owner

Related posts about lucene-index

Related posts about lucene