.NET Regular Expression to find actual words in text
- by Mehdi Anis
I am using VB .NET to write a program that will get the words from a suplied text file and count how many times each word appears. I am using this regular expression:-
parser As New Regex("\w+")
It gives me almost 100% correct words. Except when I have words like
"Ms Word App file name is word.exe." or "is this a c# statment If(ab?1,0) ?"
In such cases I get [word & exe] AND [If, a, b, 1 and 0] as seperate words. it would be nice (for my purpose) that I received word.exe and (If(ab?1,0) as words.
I guess \w+ looks for white space, sentence terminating punctuation mark and other punctuation marks to determine a word.
I want a similar regular Expression that will not break a word by a punctuation mark, if the punctuation mark is not the end of the word. I think end-of-word can be defined by a trailing WhiteSpace, Sentence terminating Punctuation (you may think of others). if you can suggest some regular expression 9for VB .NET) that will be great help.
Thanks.