Good library for search text tokenization
Posted
by
Chris Dutrow
on Programmers
See other posts from Programmers
or by Chris Dutrow
Published on 2012-11-15T18:41:31Z
Indexed on
2012/11/15
23:26 UTC
Read the original article
Hit count: 319
Looking to tokenize some text in the same or similar way in which a search engine would do it.
The reason we are doing this is so that we can run some statistical analysis on the tokens. The language we are using is python, so would prefer a library in that language, but could probably set something up to use another language if necessary.
Example
Original token:
We have some great burritos!
More simplified: (remove plurals and punctuation)
We have some great burrito
Even more simplified: (remove superfluous words)
great burrito
Best: (recognize positive and negative meaning):
burrito -positive-
© Programmers or respective owner