Tokenizer for full-text
- by user72185
This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain.
Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text.
Edit: Please note that the tokens are to be used in a search index. Creating such tokens is not an exact science (afaik) and requires some heuristics. This has been done a thousand time before, and probably in a thousand different ways, but I can't even find one of them :)
Any good pointers?
Thanks!