Language parsing to find important words
- by Matt Huggins
I'm looking for some input and theory on how to approach a lexical topic.
Let's say I have a collection of strings, which may just be one sentence or potentially multiple sentences. I'd like to parse these strings to and rip out the most important words, perhaps with a score that denotes how likely the word is to be important.
Let's look at a few examples of what I mean.
Example #1:
"I really want a Keurig, but I can't afford one!"
This is a very basic example, just one sentence. As a human, I can easily see that "Keurig" is the most important word here. Also, "afford" is relatively important, though it's clearly not the primary point of the sentence. The word "I" appears twice, but it is not important at all since it doesn't really tell us any information. I might expect to see a hash of word/scores something like this:
"Keurig" => 0.9
"afford" => 0.4
"want" => 0.2
"really" => 0.1
etc...
Example #2:
"Just had one of the best swimming practices of my life. Hopefully I can maintain my times come the competition. If only I had remembered to take of my non-waterproof watch."
This example has multiple sentences, so there will be more important words throughout. Without repeating the point exercise from example #1, I would probably expect to see two or three really important words come out of this: "swimming" (or "swimming practice"), "competition", & "watch" (or "waterproof watch" or "non-waterproof watch" depending on how the hyphen is handled).
Given a couple examples like this, how would you go about doing something similar? Are there any existing (open source) libraries or algorithms in programming that already do this?