How to extract common / significant phrases from a series of text entries

Posted by arronsky on Stack Overflow See other posts from Stack Overflow or by arronsky
Published on 2010-03-16T08:42:38Z Indexed on 2010/03/16 8:46 UTC
Read the original article Hit count: 389

Filed under:

nlp

|

nltk

|

text-extraction

|

text-analysis

I have a series of text items- raw HTML from a MYSQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching).

My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format:

"Try the hamburger" (in 44 reviews)

e.g., the "Review Highlights" section of this page:

http://www.yelp.com/biz/sushi-gen-los-angeles/

I have NLTK installed and I've played around with it a bit, but am honestly overwhelmed by the options. This seems like a rather common problem and I haven't been able to find a straightforward solution by searching here. Thanks in advance for any help.

© Stack Overflow or respective owner

Related posts about nlp

stanford pos tagger runs out of memory?

as seen on Stack Overflow - Search for 'Stack Overflow'
my stanford tagger ran out of memory. Is it because the text has to be properly formatted? This is because i use it to tag html contents, with the tags stripped, but there may have quite a excessive amount of newlines. here is the error: BlockquoWARNING: Untokenizable: ? (char in decimal: 9829) … >>> More
NLP with greatly contrained input and abilities

as seen on Stack Overflow - Search for 'Stack Overflow'
Hat in hand here. I'm a seasoned developer and I would be grateful for a bit of help. I don't have time to read or digest long intricate discussions on theoretical concepts around NLP (or go get my PHD). That said, I have read a few and it's a damn interesting field. The problem is I need real world… >>> More
NLP - Word Alignment

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi..:) I am looking for word alignment tools and algorithms, I am dealing with bilingual English - Hindi text, Currently I am working on DTW(Dynamic Time Warping) algorithm, CLA(Competitive Linking Algorithm) , NATool, Giza++. Could you please suggest me any other alogrithm/tool which is language… >>> More
AGFL npx grammar nlp techniques dependency parsing

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi I am trying to obtain a dependency parse tree using AGFL. Unfortunately I cannot understand how to derive this. I am trying to generate the npx grammar but I am still lost can someone help me please? Thanks :) L >>> More
Starting out NLP - Python + large data set

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I've been wanting to learn python and do some NLP, so have finally gotten round to starting. Downloaded the english wikipedia mirror for a nice chunky dataset to start on, and have been playing around a bit, at this stage just getting some of it into a sqlite db (havent worked with dbs in the… >>> More

Related posts about nltk

Unable to import nltk in NetBeans

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello all, I am trying to import NLTK in my python code and I get this error: Traceback (most recent call last): File "/home/afs/NetBeansProjects/NER/getNE_followers.py", line 7, in import nltk ImportError: No module named nltk I am using NetBeans: 6.7.1, Python 2.6 NLTK. My NLTK module is… >>> More
Sentiment analysis with NLTK python for sentences using sample data or webservice?

as seen on Stack Overflow - Search for 'Stack Overflow'
I am embarking upon a NLP project for sentiment analysis. I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task. Here is my task: I start with one long piece of data (lets… >>> More
Is there any way to add a new location to the list of places where nltk looks for the wordnet corpus?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I can't use the nltk wordnet lemmatizer because I can't download the wordnet corpus on my university computer due to access rights issues. I get the following error when I try to do so: ********************************************************************** Resource 'corpora/wordnet' not found… >>> More
Adjective Nominalization in Python NLTK

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, Is there a way to obtain Wordnet adjective nominalizations using NLTK? For example, for 'happy' the desired output would be 'happiness'. I tried to dig around, but couldn't find anything. Thanks! >>> More
Text mining with PHP

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I'm doing a project for a college class I'm taking. I'm using PHP to build a simple web app that classify tweets as "positive" (or happy) and "negative" (or sad) based on a set of dictionaries. The algorithm I'm thinking of right now is Naive Bayes classifier or decision tree. However, I can't… >>> More