Lucene stop words not removed during searching
- by iamrohitbanga
I have created a Lucene index with the following analyzer.
public class DocSpecAnalyzer extends Analyzer {
private static CharArraySet stopSet;// = new HashSet<String>(Arrays.asList());//STOP_WORDS_SET;
static {
stopSet = new CharArraySet(FDConstants.stopwords, true);
// uncommenting this displays all the stop words
// for (String s: FDConstants.stopwords) {
// System.out.println(s);
// }
}
/**
* Specifies whether deprecated acronyms should be replaced with HOST type.
* See {@linkplain https://issues.apache.org/jira/browse/LUCENE-1068}
*/
private final boolean enableStopPositionIncrements;
private final Version matchVersion;
public DocSpecAnalyzer(Version matchVersion) {
this.matchVersion = matchVersion;
enableStopPositionIncrements =
StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion);
}
public TokenStream tokenStream(String fieldName, Reader reader) {
StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader);
tokenStream.setMaxTokenLength(DEFAULT_MAX_TOKEN_LENGTH);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
result = new StopFilter(enableStopPositionIncrements, result, stopSet);
result = new PorterStemFilter(result);
return result;
}
/** Default maximum allowed token length */
public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
}
Now when I search for documents for a query containing stop words, i get hits for stop words also. As I post this problem, I found the bug. It is because of http://lucene.apache.org/java/2_9_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html not handling stop words.
Is there a substitute?
Update: forgot to mention that I need to do a fuzzy search. that is why i am using an AnalyzingQueryParser.