Lucene Analyzer to Use With Special Characters and Punctuation?

Posted by Brandon on Stack Overflow See other posts from Stack Overflow or by Brandon
Published on 2010-04-29T02:19:15Z Indexed on 2010/04/29 2:27 UTC
Read the original article Hit count: 256

Filed under:
|
|
|

I have a Lucene index that has several documents in it. Each document has multiple fields such as:

Id
Project
Name
Description

The Id field will be a unique identifier such as a GUID, Project is a user's ProjectID and a user can only view documents for their project, and Name and Description contain text that can have special characters.

When a user performs a search on the Name field, I want to be able to attempt to match the best I can such as:

First

Will return both:

First.Last 

and

First.Middle.Last

Name can also be something like:

Test (NameTest)

Where, if a user types in 'Test', 'Name', or '(NameTest)', then they can find the result.

However, if I say that Project is 'ProjectA' then that needs to be an exact match (case insensitive search). The same goes with the Id field.

Which fields should I set up as Tokenized and which as Untokenized? Also, is there a good Analyzer I should consider to make this happen?

I am stuck trying to decide the best route to implement the desired searching.

© Stack Overflow or respective owner

Related posts about c#

Related posts about lucene