Ngram IDF smoothing
- by adi92
I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents.
I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others
The problem that I am running into is that some (3,4)-grams in my data which have super-high idf actually consist of component unigrams and bigrams which have really low idf..
For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf..
I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest.
If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores.
Anybody has any better ideas?