Tokenizer for full-text

Posted by user72185 on Stack Overflow See other posts from Stack Overflow or by user72185
Published on 2010-04-08T14:24:50Z Indexed on 2010/04/08 14:53 UTC
Read the original article Hit count: 356

Filed under:
|
|

This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain.

Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text.

Edit: Please note that the tokens are to be used in a search index. Creating such tokens is not an exact science (afaik) and requires some heuristics. This has been done a thousand time before, and probably in a thousand different ways, but I can't even find one of them :)

Any good pointers?

Thanks!

© Stack Overflow or respective owner

Related posts about c++

Related posts about tokenizer