.NET Regular Expression to find actual words in text

Posted by Mehdi Anis on Stack Overflow See other posts from Stack Overflow or by Mehdi Anis
Published on 2010-04-19T19:52:35Z Indexed on 2010/04/19 19:53 UTC
Read the original article Hit count: 287

Filed under:
|

I am using VB .NET to write a program that will get the words from a suplied text file and count how many times each word appears. I am using this regular expression:-

parser As New Regex("\w+")

It gives me almost 100% correct words. Except when I have words like

"Ms Word App file name is word.exe." or "is this a c# statment If(a>b?1,0) ?"

In such cases I get [word & exe] AND [If, a, b, 1 and 0] as seperate words. it would be nice (for my purpose) that I received word.exe and (If(a>b?1,0) as words.

I guess \w+ looks for white space, sentence terminating punctuation mark and other punctuation marks to determine a word.

I want a similar regular Expression that will not break a word by a punctuation mark, if the punctuation mark is not the end of the word. I think end-of-word can be defined by a trailing WhiteSpace, Sentence terminating Punctuation (you may think of others). if you can suggest some regular expression 9for VB .NET) that will be great help.

Thanks.

© Stack Overflow or respective owner

Related posts about .NET

Related posts about regex