How to write efficient code for extracting Noun phrases?
- by Arun Abraham
I am trying to extract phrases using rules such as the ones mentioned below on text which has been POS tagged
1) NNP - NNP (- indicates followed by)
2) NNP - CC - NNP
3) VP - NP
etc..
I have written code in this manner, Can someone tell me how i can do in a better manner.
List<String> nounPhrases = new ArrayList<String>();
for (List<HasWord> sentence : documentPreprocessor) {
//System.out.println(sentence.toString());
System.out.println(Sentence.listToString(sentence, false));
List<TaggedWord> tSentence = tagger.tagSentence(sentence);
String lastTag = null, lastWord = null;
for (TaggedWord taggedWord : tSentence) {
if (lastTag != null && taggedWord.tag().equalsIgnoreCase("NNP") && lastTag.equalsIgnoreCase("NNP")) {
nounPhrases.add(taggedWord.word() + " " + lastWord);
//System.out.println(taggedWord.word() + " " + lastWord);
}
lastTag = taggedWord.tag();
lastWord = taggedWord.word();
}
}
In the above code, i have done only for NNP followed by NNP extraction, how can i generalise it so that i can add other rules too. I know that there are libraries available for doing this , but wanted to do this manually.