Ngram IDF smoothing

Posted by adi92 on Stack Overflow See other posts from Stack Overflow or by adi92
Published on 2010-06-10T18:47:58Z Indexed on 2010/06/10 18:52 UTC
Read the original article Hit count: 364

Filed under:

machine-learning

|

nlp

|

information-retrieval

|

tf-idf

I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents.
I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others
The problem that I am running into is that some (3,4)-grams in my data which have super-high idf actually consist of component unigrams and bigrams which have really low idf..
For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf..
I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest.
If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores.
Anybody has any better ideas?

© Stack Overflow or respective owner

Related posts about machine-learning

Machine learning challenge: diagnosing program in java/groovy (datamining, machine learning)

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi All! I'm planning to develop program in Java which will provide diagnosis. The data set is divided into two parts one for training and the other for testing. My program should learn to classify from the training data (BTW which contain answer for 30 questions each in new column, each record in… >>> More
Is it possible to predict future using machine learning and/or AI?

as seen on Programmers - Search for 'Programmers'
Recently I have started reading about machine learning. From 3000 feet view, machine learning seems really great thing but as if now I have found that machine learning is limited to only 3 types of algorithms namely classification, clustering and recommendations. I would like to know if my assumption… >>> More
Design for a machine learning artificial intelligence framework

as seen on Stack Overflow - Search for 'Stack Overflow'
This is a community wiki which aims to provide a good design for a machine learning/artificial intelligence framework (ML/AI framework). Please contribute to the design of a language-agnostic framework which would allow multiple ML/AI algorithms to be plugged into a single framework which: runs… >>> More
A good machine learning technique to weed out good URLs from bad

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I have an application that needs to discriminate between good HTTP GET requests and bad. For example: http://somesite.com?passes=dodgy+parameter # BAD http://anothersite.com?passes=a+good+parameter # GOOD My system can make a binary decision about whether or not a… >>> More
Design for a machine learning artificial intelligence framework (community wiki)

as seen on Stack Overflow - Search for 'Stack Overflow'
This is a community wiki which aims to provide a good design for a machine learning/artificial intelligence framework (ML/AI framework). Please contribute to the design of a language-agnostic framework which would allow multiple ML/AI algorithms to be plugged into a single framework which: runs… >>> More

Related posts about nlp

stanford pos tagger runs out of memory?

as seen on Stack Overflow - Search for 'Stack Overflow'
my stanford tagger ran out of memory. Is it because the text has to be properly formatted? This is because i use it to tag html contents, with the tags stripped, but there may have quite a excessive amount of newlines. here is the error: BlockquoWARNING: Untokenizable: ? (char in decimal: 9829) … >>> More
NLP with greatly contrained input and abilities

as seen on Stack Overflow - Search for 'Stack Overflow'
Hat in hand here. I'm a seasoned developer and I would be grateful for a bit of help. I don't have time to read or digest long intricate discussions on theoretical concepts around NLP (or go get my PHD). That said, I have read a few and it's a damn interesting field. The problem is I need real world… >>> More
NLP - Word Alignment

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi..:) I am looking for word alignment tools and algorithms, I am dealing with bilingual English - Hindi text, Currently I am working on DTW(Dynamic Time Warping) algorithm, CLA(Competitive Linking Algorithm) , NATool, Giza++. Could you please suggest me any other alogrithm/tool which is language… >>> More
AGFL npx grammar nlp techniques dependency parsing

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi I am trying to obtain a dependency parse tree using AGFL. Unfortunately I cannot understand how to derive this. I am trying to generate the npx grammar but I am still lost can someone help me please? Thanks :) L >>> More
Starting out NLP - Python + large data set

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I've been wanting to learn python and do some NLP, so have finally gotten round to starting. Downloaded the english wikipedia mirror for a nice chunky dataset to start on, and have been playing around a bit, at this stage just getting some of it into a sqlite db (havent worked with dbs in the… >>> More