Good library for search text tokenization

Posted by Chris Dutrow on Programmers See other posts from Programmers or by Chris Dutrow
Published on 2012-11-15T18:41:31Z Indexed on 2012/11/15 23:26 UTC
Read the original article Hit count: 319

Filed under:
|
|
|

Looking to tokenize some text in the same or similar way in which a search engine would do it.

The reason we are doing this is so that we can run some statistical analysis on the tokens. The language we are using is python, so would prefer a library in that language, but could probably set something up to use another language if necessary.

Example

Original token:

We have some great burritos!

More simplified: (remove plurals and punctuation)

We have some great burrito

Even more simplified: (remove superfluous words)

great burrito

Best: (recognize positive and negative meaning):

burrito -positive-

© Programmers or respective owner

Related posts about python

Related posts about search