How to sift idioms and set phrases apart from other common phrases using NLP techniques?
- by hippietrail
What techniques exist that can tell the difference betwen plain common phrases such as "to the", "and the" and set phrases and idioms which have their own lexical meanings such as "pick up", "fall in love", "red herring", "dead end"?
Are there techniques which are successful even without a dictionary, statistical methods HMMs train on large corpora for instance?
Or are there heuristics such as ignoring or weighting down "promiscuous" words which can co-occur with just about any word versus words which occur either alone or in a specific limited set of idiomatic phrases?
If there are such heuristics, how do we take into account set phrases and verbal phrases which do incorporate promiscuous words such as "up" in "beat up", "eat up", "sit up", "think up"?
UPDATE
I've found an interesting paper online: Unsupervised Type and Token Identi?cation of Idiomatic Expressions