Word frequency tally script is too slow
Posted
by
Dave Jarvis
on Stack Overflow
See other posts from Stack Overflow
or by Dave Jarvis
Published on 2011-01-07T15:49:22Z
Indexed on
2011/01/07
15:53 UTC
Read the original article
Hit count: 169
Background
Created a script to count the frequency of words in a plain text file. The script performs the following steps:
- Count the frequency of words from a corpus.
- Retain each word in the corpus found in a dictionary.
- Create a comma-separated file of the frequencies.
The script is at: http://pastebin.com/VAZdeKXs
Problem
The following lines continually cycle through the dictionary to match words:
for i in $(awk '{if( $2 ) print $2}' frequency.txt); do
grep -m 1 ^$i\$ dictionary.txt >> corpus-lexicon.txt;
done
It works, but it is slow because it is scanning the words it found to remove any that are not in the dictionary. The code performs this task by scanning the dictionary for every single word. (The -m 1
parameter stops the scan when the match is found.)
Question
How would you optimize the script so that the dictionary is not scanned from start to finish for every single word? The majority of the words will not be in the dictionary.
Thank you!
© Stack Overflow or respective owner