Good library for search text tokenization
- by Chris Dutrow
Looking to tokenize some text in the same or similar way in which a search engine would do it.
The reason we are doing this is so that we can run some statistical analysis on the tokens. The language we are using is python, so would prefer a library in that language, but could probably set something up to use another language if necessary.
Example
Original token:
We have some great burritos!
More simplified: (remove plurals and punctuation)
We have some great burrito
Even more simplified: (remove superfluous words)
great burrito
Best: (recognize positive and negative meaning):
burrito -positive-