How does lucene index documents?

Posted by Mehdi Amrollahi on Stack Overflow See other posts from Stack Overflow or by Mehdi Amrollahi
Published on 2010-04-08T17:51:25Z Indexed on 2010/04/08 18:03 UTC
Read the original article Hit count: 579

Filed under:

lucene

|

indexing

|

algorithm

Hello,

I read some document about Lucene; also I read the document in this link (http://lucene.sourceforge.net/talks/pisa).

I don't really understand how Lucene indexes documents and don't understand which algorithms Lucene uses for indexing?

On the above link, it says Lucene uses this algorithm for indexing:

incremental algorithm:
- maintain a stack of segment indices
- create index for each incoming document
- push new indexes onto the stack
- let b=10 be the merge factor; M=8

for (size = 1; size < M; size *= b) {
    if (there are b indexes with size docs on top of the stack) {
        pop them off the stack;
        merge them into a single index;
        push the merged index onto the stack;
    } else {
        break;
    }
}

How does this algorithm provide optimized indexing?

Does Lucene use B-tree algorithm or any other algorithm like that for indexing - or does it have a particular algorithm?

Thank you for reading my post.

© Stack Overflow or respective owner

Related posts about lucene

performance comparision between Zend Lucene and Java Lucene

as seen on Stack Overflow - Search for 'Stack Overflow'
Zend Lucene and Java Lucene are built in PHP and java repectively, and PHP language has a higher level than java. Just wondering How big the performance difference among these two, regarding to index building and data searching? Is it much more effective to let java create and rebuild index, and… >>> More
Why wasn't fast-vector-highlighter (lucene-contrib) made an official part of Lucene 3.0 core

as seen on Stack Overflow - Search for 'Stack Overflow'
I've read some Jira entries and they mentioned moving fast-vector-highlighter to core about a year ago but it never made it. Looking at the svn for contrib it seems incomplete. There are no tests for FastVectorHighlighter Documentation is lacking No samples anywhere on apache.org Anyone have… >>> More
pylucene: install error

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to install Pylucene (pylucene-3.3-3-src.tar.gz) on my ubuntu linux 11.10. I have python 2.7.2. I was able to compile JCC (I think) because I didnt see any error when I installed it. When I tried to install Pylucene I get the following error. Can someone help? Thanks. ICU not installed /usr/bin/python… >>> More
Solr WordDelimiterFilter + Lucene Highlighter

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to get the Highlighter class from Lucene to work properly with tokens coming from Solr's WordDelimiterFilter. It works 90% of the time, but if the matching text contains a ',' such as "1,500" the output is incorrect: Expected: 'test 1,500 this' Observed: 'test 11,500 this' I… >>> More
java AbstractMethodError

as seen on Stack Overflow - Search for 'Stack Overflow'
How to handle this error in lucene: java.lang.AbstractMethodError: org.apache.lucene.store.Directory.listAll()[Ljava/lang/String; at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:568) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) … >>> More

Related posts about indexing

Outlook 2007 OST File Indexing and OneNote 2007 Indexing are Broken

as seen on Super User - Search for 'Super User'
I'm running Outlook 2007 under Windows 7 Home Premium RTM. My OST file was previously being properly indexed but eventually searches significantly slowed down so I suspected a problem. Searching and indexing appears broken in OneNote 2007 as well as search time is now significantly longer. I brought… >>> More
IIS6 Indexing Service indexing asp.net codebehind (.aspx.cs) files

as seen on Server Fault - Search for 'Server Fault'
I've setup a few catalogs on an Windows Server 2003 IIS6 install, each tracking files within a website. In the Properties - Generation Dialog for each catalog, 'Index files with unknown extensions' is turned OFF. 'Inherit above settings from Service' in that dialog is also turned off. However, the… >>> More
Indexing vs. no indexing when inserting records

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a few questions about whether or not it would be best to not use indexing. BACKGROUND: My records have a timestamp attribute, and the records will be inserted in order of their timestamps (i.e., inserted chronologically). QUESTIONS: If I DON'T use indexing is it typical for the database… >>> More
SQL SERVER – Transcript of Learning SQL Server Performance: Indexing Basics – Interview of Vinod Kumar by Pinal Dave

as seen on SQL Authority - Search for 'SQL Authority'
Recently I just wrote a blog post on about Learning SQL Server Performance: Indexing Basics and I received lots of request that if we can share some insight into the course. Here is 200 seconds interview of Vinod Kumar I took right after completing the course. We have few free codes to watch the course… >>> More
Can the Windows Indexing Service restart an app pool if it doesn't index the Web.Config?

as seen on Stack Overflow - Search for 'Stack Overflow'
I am having a slight debate with a colleague of mine on this subject. Scenario: A web site has a sub directory that is indexed using the Windows Indexing Service. There is not a web.config in this directory and the indexing service is not indexing the parent directory which is the site root. Something… >>> More