How to handle very frequent updates to a Lucene index

Posted by fsm on Stack Overflow See other posts from Stack Overflow or by fsm
Published on 2010-09-30T21:15:57Z Indexed on 2011/02/21 15:25 UTC
Read the original article Hit count: 213

Filed under:
|
|

I am trying to prototype an indexing/search application which uses very volatile indexing data sources (forums, social networks etc), here are some of the performance requirements,

  1. Very fast turn-around time (by this I mean that any new data (such as a new message on a forum) should be available in the search results very soon (less than a minute))

  2. I need to discard old documents on a fairly regular basis to ensure that the search results are not dated.

  3. Last but not least, the search application needs to be responsive. (latency on the order of 100 milliseconds, and should support at least 10 qps)

All of the requirements I have currently can be met w/o using Lucene (and that would let me satisfy all 1,2 and 3), but I am anticipating other requirements in the future (like search relevance etc) which Lucene makes easier to implement. However, since Lucene is designed for use cases far more complex than the one I'm currently working on, I'm having a hard time satisfying my performance requirements.

Here are some questions,

a. I read that the optimize() method in the IndexWriter class is expensive, and should not be used by applications that do frequent updates, what are the alternatives?

b. In order to do incremental updates, I need to keep committing new data, and also keep refreshing the index reader to make sure it has the new data available. These are going to affect 1 and 3 above. Should I try duplicate indices? What are some common approaches to solving this problem?

c. I know that Lucene provides a delete method, which lets you delete all documents that match a certain query, in my case, I need to delete all documents which are older than a certain age, now one option is to add a date field to every document and use that to delete documents later. Is it possible to do range queries on document ids (I can create my own id field since I think that the one created by lucene keeps changing) to delete documents? Is it any faster than comparing dates represented as strings?

I know these are very open questions, so I am not looking for a detailed answer, I will try to treat all of your answers as suggestions and use them to inform my design. Thanks! Please let me know if you need any other information.

© Stack Overflow or respective owner

Related posts about Performance

Related posts about indexing