How to index a string like "aaa.bbb.ddd-fff" in Lucene?
Posted
by user46703
on Stack Overflow
See other posts from Stack Overflow
or by user46703
Published on 2010-05-27T21:59:31Z
Indexed on
2010/05/27
22:01 UTC
Read the original article
Hit count: 295
Hi,
I have to index a lot documents that contain reference numbers like "aaa.bbb.ddd-fff". The structure can change but it's always some arbitrary numbers or characters combined with "/","-","_" or some other delimiter.
The users want to be able to search for any of the substrings like "aaa" or "ddd" and also for combinations like "aaa.bbb" or "ddd-fff". The best I have been able to come up with is to create my own token filter modeled after the synonym filter in "Lucene in action" which spits out multiple terms for each input. In my case I return "aaa.bbb", "bbb.ddd","bbb.ddd-fff" and all other combinations of the substrings. This works pretty well but when I index large documents (100MB) that contain lots of such strings I tend to get out of memory exceptions because my filter returns multiple terms for each input string.
Is there a better way to index these strings?
© Stack Overflow or respective owner