Tokenizer for full-text
Posted
by user72185
on Stack Overflow
See other posts from Stack Overflow
or by user72185
Published on 2010-04-08T14:24:50Z
Indexed on
2010/04/08
14:53 UTC
Read the original article
Hit count: 358
This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain.
Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text.
Edit: Please note that the tokens are to be used in a search index. Creating such tokens is not an exact science (afaik) and requires some heuristics. This has been done a thousand time before, and probably in a thousand different ways, but I can't even find one of them :)
Any good pointers?
Thanks!
© Stack Overflow or respective owner