How to extract common / significant phrases from a series of text entries
- by arronsky
I have a series of text items- raw HTML from a MYSQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching).
My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format:
"Try the hamburger" (in 44 reviews)
e.g., the "Review Highlights" section of this page:
http://www.yelp.com/biz/sushi-gen-los-angeles/
I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here. Thanks in advance for any help.