adi92 - Developer IT

Ngram IDF smoothing

- by adi92

I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents. I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others The problem that I am running into is that some (3,4)-grams in my data which have super-high idf actually consist of component unigrams and bigrams which have really low idf.. For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf.. I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest. If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores. Anybody has any better ideas?

Read the article

Converting an empty string into nil in Ruby

- by adi92

I have a string called word and a function called infinitive such that word.infinitive would return another string on some occasions and an empty string otherwise I am trying to find an elegant ruby one line expression for the code-snippet below if word.infinitive == "" return word else return word.infinitive Had infinitive returned nil instead of "", I could have done something like (word.infinitive or word) But since it does not, I can't take advantage of the short-circuit OR Ideally I would want 1) a single expression that I could easily embed in other code 2) the function infinitive being called only once 3) to not add any custom gems or plugins into my code

Read the article

Making fscanf Ignore Optional Parameter

- by adi92

I am using fscanf to read a file which has lines like Number <-whitespace- string <-whitespace- optional_3rd_column I wish to extract the number and string out of each column, but ignore the 3rd_column if it exists Example Data: 12 foo something 03 bar 24 something #randomcomment I would want to extract 12,foo; 03,bar; 24, something while ignoring "something" and "#randomcomment" I currently have something like while(scanf("%d %s %*s",&num,&word)>=2) { assign stuff } However this does not work with lines with no 3rd column. How can I make it ignore everything after the 2nd string?

Developer IT