How to extract common / significant phrases from a series of text entries
Posted
by arronsky
on Stack Overflow
See other posts from Stack Overflow
or by arronsky
Published on 2010-03-16T08:42:38Z
Indexed on
2010/03/16
8:46 UTC
Read the original article
Hit count: 363
I have a series of text items- raw HTML from a MYSQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching).
My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format:
"Try the hamburger" (in 44 reviews)
e.g., the "Review Highlights" section of this page:
http://www.yelp.com/biz/sushi-gen-los-angeles/
I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here. Thanks in advance for any help.
© Stack Overflow or respective owner