How to extract common / significant phrases from a series of text entries

Posted by arronsky on Stack Overflow See other posts from Stack Overflow or by arronsky
Published on 2010-03-16T08:42:38Z Indexed on 2010/03/16 8:46 UTC
Read the original article Hit count: 363

I have a series of text items- raw HTML from a MYSQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching).

My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format:

"Try the hamburger" (in 44 reviews)

e.g., the "Review Highlights" section of this page:

http://www.yelp.com/biz/sushi-gen-los-angeles/

I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here. Thanks in advance for any help.

© Stack Overflow or respective owner

Related posts about nlp

Related posts about nltk