johnv - Developer IT

Improve Efficiency for This Text Processing Code

- by johnv

I am writing a program that counts the number of words in a text file which is already in lowercase and separated by spaces. I want to use a dictionary and only count the word IF it's within the dictionary. The problem is the dictionary is quite large (~100,000 words) and each text document has also ~50,000 words. As such, the codes that I wrote below gets very slow (takes about 15 sec to process one document on a quad i7 machine). I'm wondering if there's something wrong with my coding and if the efficiency of the program can be improved. Thanks so much for your help. Code below: public static string WordCount(string countInput) { string[] keywords = ReadDic(); /* read dictionary txt file*/ /*then reads the main text file*/ Dictionary<string, int> dict = ReadFile(countInput).Split(' ') .Select(c => c) .Where(c => keywords.Contains(c)) .GroupBy(c => c) .Select(g => new { word = g.Key, count = g.Count() }) .OrderBy(g => g.word) .ToDictionary(d => d.word, d => d.count); int s = dict.Sum(e => e.Value); string k = s.ToString(); return k; }

Read the article

Newbie Question: Read and Process a List of Text Files

- by johnv

I'm completely new to .NET and am trying as a first step to write a text processing program. The task is simple: I have a list of 10,000 text files stored in one folder, and I'm trying to read each one, store it as a string variable, then run it through a series of functions, then save the final output to another folder. So far I can only manage to manually input the file path like this (in VB.NET): Dim tRead As System.IO.StreamReader Public Function ReadFile() As String Dim EntireFile As String tRead = File.OpenText("c:\textexample\00001.txt") EntireFile = tRead.ReadToEnd Return EntireFile End Function Public Function Step1() ..... End Function Public Function Step2() ..... End Function .............. I'm wondering, therefore, if there's a way to automate this process. Perhaps for example store all input file path into a text file then read each entry at a time, then save the final output into the save path, again listed in a text file. Any help is greatly appreciated. ReplyQuote

Read the article

Regex to delete HTML within <table> tags

- by johnv

I have an HTML document in .txt format containing multiple tables and other texts and I am trying to delete any HTML (anything within "<") if it's inside a table (between and ). For example: =================== other text <other HTML> <table> <b><u><i>bold underlined italic text</b></u></i> </table> other text <other HTML> ============== The final output would be as the following. Note that only HTML within and are removed. ============== other text <other HTML> <table> bold underlined italic text </table> other text <other HTML> ============= Any help is greatly appreciated!

Developer IT