string matching algorithms used by lucene
Posted
by iamrohitbanga
on Stack Overflow
See other posts from Stack Overflow
or by iamrohitbanga
Published on 2010-02-05T16:31:28Z
Indexed on
2010/04/29
13:07 UTC
Read the original article
Hit count: 460
i want to know the string matching algorithms used by Apache Lucene. i have been going through the index file format used by lucene given here. it seems that lucene stores all words occurring in the text as is with their frequency of occurrence in each document. but as far as i know that for efficient string matching it would need to preprocess the words occurring in the Documents.
example: search for "iamrohitbanga is a user of stackoverflow" (use fuzzy matching)
in some documents.
it is possible that there is a document containing the string "rohit banga"
to find that the substrings rohit and banga are present in the search string, it would use some efficient substring matching.
i want to know which algorithm it is. also if it does some preprocessing which function call in the java api triggers it.
© Stack Overflow or respective owner