string matching algorithms used by lucene

Posted by iamrohitbanga on Stack Overflow See other posts from Stack Overflow or by iamrohitbanga
Published on 2010-02-05T16:31:28Z Indexed on 2010/04/29 13:07 UTC
Read the original article Hit count: 551

Filed under:

lucene

|

string-matching

|

algorithm

|

java

i want to know the string matching algorithms used by Apache Lucene. i have been going through the index file format used by lucene given here. it seems that lucene stores all words occurring in the text as is with their frequency of occurrence in each document. but as far as i know that for efficient string matching it would need to preprocess the words occurring in the Documents.

example: search for "iamrohitbanga is a user of stackoverflow" (use fuzzy matching)

in some documents.

it is possible that there is a document containing the string "rohit banga"

to find that the substrings rohit and banga are present in the search string, it would use some efficient substring matching.

i want to know which algorithm it is. also if it does some preprocessing which function call in the java api triggers it.

© Stack Overflow or respective owner

Related posts about lucene

performance comparision between Zend Lucene and Java Lucene

as seen on Stack Overflow - Search for 'Stack Overflow'
Zend Lucene and Java Lucene are built in PHP and java repectively, and PHP language has a higher level than java. Just wondering How big the performance difference among these two, regarding to index building and data searching? Is it much more effective to let java create and rebuild index, and… >>> More
Why wasn't fast-vector-highlighter (lucene-contrib) made an official part of Lucene 3.0 core

as seen on Stack Overflow - Search for 'Stack Overflow'
I've read some Jira entries and they mentioned moving fast-vector-highlighter to core about a year ago but it never made it. Looking at the svn for contrib it seems incomplete. There are no tests for FastVectorHighlighter Documentation is lacking No samples anywhere on apache.org Anyone have… >>> More
pylucene: install error

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to install Pylucene (pylucene-3.3-3-src.tar.gz) on my ubuntu linux 11.10. I have python 2.7.2. I was able to compile JCC (I think) because I didnt see any error when I installed it. When I tried to install Pylucene I get the following error. Can someone help? Thanks. ICU not installed /usr/bin/python… >>> More
Solr WordDelimiterFilter + Lucene Highlighter

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to get the Highlighter class from Lucene to work properly with tokens coming from Solr's WordDelimiterFilter. It works 90% of the time, but if the matching text contains a ',' such as "1,500" the output is incorrect: Expected: 'test 1,500 this' Observed: 'test 11,500 this' I… >>> More
java AbstractMethodError

as seen on Stack Overflow - Search for 'Stack Overflow'
How to handle this error in lucene: java.lang.AbstractMethodError: org.apache.lucene.store.Directory.listAll()[Ljava/lang/String; at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:568) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) … >>> More

Related posts about string-matching

Approximate string matching with a letter confusion matrix?

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to model a phonetic recognizer that has to isolate instances of words (strings of phones) out of a long stream of phones that doesn't have gaps between each word. The stream of phones may have been poorly recognized, with letter substitutions/insertions/deletions, so I will have to do… >>> More
sample java code for approximate string matching or boyer-moore extended for approximate string matc

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi I need to find 1.mismatch(incorrectly played notes), 2.insertion(additional played), & 3.deletion (missed notes), in a music piece (e.g. note pitches [string values] stored in a table) against a reference music piece. This is either possible through exact string matching algorithms or dynamic… >>> More
Ranking based string matching algorithm..for Midi Music

as seen on Stack Overflow - Search for 'Stack Overflow'
i am working on midi music project. What i am trying to do is:- matching the Instrument midi track with the similar instrument midi track... for example Flute track in a some midi music is matched against the Flute track in some other music midi file... After matching ,the results should come ranking… >>> More
String matching.

as seen on Stack Overflow - Search for 'Stack Overflow'
How to match the string Net-----Amount (or here between Net and Amount there can be any number of space) with net amount ? Consider ----- as space because I could not keep the space between these two words in the editor. >>> More
String Matching.

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a string String mainString="///BUY/SELL///ORDERTIME///RT///QTY///BROKERAGE///NETRATE///AMOUNTRS///RATE///SCNM///"; Now I have another strings String str1= "RT"; which should be matched only with RT which is substring of string mainString but not with ORDERTIME which is also substring… >>> More