Lucene Analyzer to Use With Special Characters and Punctuation?

Posted by Brandon on Stack Overflow See other posts from Stack Overflow or by Brandon
Published on 2010-04-29T02:19:15Z Indexed on 2010/04/29 2:27 UTC
Read the original article Hit count: 272

Filed under:

c#

|

lucene

|

lucene.net

|

indexing

I have a Lucene index that has several documents in it. Each document has multiple fields such as:

Id
Project
Name
Description

The Id field will be a unique identifier such as a GUID, Project is a user's ProjectID and a user can only view documents for their project, and Name and Description contain text that can have special characters.

When a user performs a search on the Name field, I want to be able to attempt to match the best I can such as:

First

Will return both:

First.Last

and

First.Middle.Last

Name can also be something like:

Test (NameTest)

Where, if a user types in 'Test', 'Name', or '(NameTest)', then they can find the result.

However, if I say that Project is 'ProjectA' then that needs to be an exact match (case insensitive search). The same goes with the Id field.

Which fields should I set up as Tokenized and which as Untokenized? Also, is there a good Analyzer I should consider to make this happen?

I am stuck trying to decide the best route to implement the desired searching.

© Stack Overflow or respective owner

Related posts about c#

.NET WebRequest.PreAuthenticate not quite what it sounds like

as seen on West-Wind - Search for 'West-Wind'
I’ve run into the problem a few times now: How to pre-authenticate .NET WebRequest calls doing an HTTP call to the server – essentially send authentication credentials on the very first request instead of waiting for a server challenge first? At first glance this sound like it should be easy:… >>> More
HttpWebRequest and Ignoring SSL Certificate Errors

as seen on West-Wind - Search for 'West-Wind'
Man I can't believe this. I'm still mucking around with OFX servers and it drives me absolutely crazy how some these servers are just so unbelievably misconfigured. I've recently hit three different 3 major brokerages which fail HTTP validation with bad or corrupt certificates at least according to… >>> More
The dynamic Type in C# Simplifies COM Member Access from Visual FoxPro

as seen on West-Wind - Search for 'West-Wind'
I’ve written quite a bit about Visual FoxPro interoperating with .NET in the past both for ASP.NET interacting with Visual FoxPro COM objects as well as Visual FoxPro calling into .NET code via COM Interop. COM Interop with Visual FoxPro has a number of problems but one of them at least got a lot… >>> More
Dynamic Type to do away with Reflection

as seen on West-Wind - Search for 'West-Wind'
The dynamic type in C# 4.0 is a welcome addition to the language. One thing I’ve been doing a lot with it is to remove explicit Reflection code that’s often necessary when you ‘dynamically’ need to walk and object hierarchy. In the past I’ve had a number of ReflectionUtils that used string based expressions… >>> More
Finding a Relative Path in .NET

as seen on West-Wind - Search for 'West-Wind'
Here’s a nice and simple path utility that I’ve needed in a number of applications: I need to find a relative path based on a base path. So if I’m working in a folder called c:\temp\templates\ and I want to find a relative path for c:\temp\templates\subdir\test.txt I want to receive back subdir\test… >>> More

Related posts about lucene

performance comparision between Zend Lucene and Java Lucene

as seen on Stack Overflow - Search for 'Stack Overflow'
Zend Lucene and Java Lucene are built in PHP and java repectively, and PHP language has a higher level than java. Just wondering How big the performance difference among these two, regarding to index building and data searching? Is it much more effective to let java create and rebuild index, and… >>> More
Why wasn't fast-vector-highlighter (lucene-contrib) made an official part of Lucene 3.0 core

as seen on Stack Overflow - Search for 'Stack Overflow'
I've read some Jira entries and they mentioned moving fast-vector-highlighter to core about a year ago but it never made it. Looking at the svn for contrib it seems incomplete. There are no tests for FastVectorHighlighter Documentation is lacking No samples anywhere on apache.org Anyone have… >>> More
pylucene: install error

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to install Pylucene (pylucene-3.3-3-src.tar.gz) on my ubuntu linux 11.10. I have python 2.7.2. I was able to compile JCC (I think) because I didnt see any error when I installed it. When I tried to install Pylucene I get the following error. Can someone help? Thanks. ICU not installed /usr/bin/python… >>> More
Solr WordDelimiterFilter + Lucene Highlighter

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to get the Highlighter class from Lucene to work properly with tokens coming from Solr's WordDelimiterFilter. It works 90% of the time, but if the matching text contains a ',' such as "1,500" the output is incorrect: Expected: 'test 1,500 this' Observed: 'test 11,500 this' I… >>> More
java AbstractMethodError

as seen on Stack Overflow - Search for 'Stack Overflow'
How to handle this error in lucene: java.lang.AbstractMethodError: org.apache.lucene.store.Directory.listAll()[Ljava/lang/String; at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:568) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) … >>> More