How do I detect if there is already a similar document stored in Lucene index.
Posted
by Jenea
on Stack Overflow
See other posts from Stack Overflow
or by Jenea
Published on 2010-02-09T17:02:59Z
Indexed on
2010/04/02
2:13 UTC
Read the original article
Hit count: 462
Hi.
I need to exclude duplicates in my database. The problem is that duplicates are not considered exact match but rather similar documents. For this purpose I decided to use FuzzyQuery
like follows:
var fuzzyQuery = new global::Lucene.Net.Search.FuzzyQuery(
new Term("text", queryText),
0.8f,
0);
hits = _searcher.Search(query);
The idea was to set the minimal similarity to 0.8 (that I think is high enough) so only similar documents will be found excluding those that are not sufficiently similar.
To test this code I decided to see if it finds already existing document. To the variable queryText
was assigned a value that is stored in the index. The code from above found nothing, in other words it doesn't detect even exact match.
Index was build by this code:
doc.Add(new global::Lucene.Net.Documents.Field(
"text",
text,
global::Lucene.Net.Documents.Field.Store.YES,
global::Lucene.Net.Documents.Field.Index.TOKENIZED,
global::Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));
I followed recomendations from bellow and the results are: TermQuery doesn't return any result. Query contructed with
var _analyzer = new RussianAnalyzer();
var parser = new global::Lucene.Net.QueryParsers
.QueryParser("text", _analyzer);
var query = parser.Parse(queryText);
var _searcher = new IndexSearcher
(Settings.General.Default.LuceneIndexDirectoryPath);
var hits = _searcher.Search(query);
Returns several results with the maximum score the document that has exact match and other several documents that have similar content.
© Stack Overflow or respective owner