joining text files with 600M+ lines
- by dnkb
I have two files huge.txt and small.txt. Huge has around 600M rows and it's 14Gigs, each line has four space separated words (tokens) and finally another space separated column with a number. Small has 150K rows with a size of ~3M, a space separated word and a number.
Both Files are sorted using the sort command, with no extra options. The words in both files may include apostrophes (') and dashes (-).
The deisred output would contain all columns from the huge.txt and the second column (the number) from small txt where the first word of huge.txt and the first word of small.txt match.
My attemtpts below failed miserably with the following error:
cat huge.txt|join -o 1.1 1.2 1.3 1.4 2.2 - small.txt > output.txt
join: memory exhausted
What I suspect is that the sorting order isn't right somehow even though the files are pre-sorted using:
sort -k1 huge.unsorted.txt > huge.txt
sort -k1 small.unsorted.txt > small.txt
Problems seem to appear around words that have apostrophes (') or dashes (-). I also tried dictinoary sorting using the -d option bumping into the same error at the end.
I see two ways out of this but don't know how to implement any of them.
1) Any tips how to sort the files in a way that the join command considers them to be sorted properly?
2) I was thinking of calculating MD5 or some other hashes of the strings to get rid of the apostrophes and dashes, and do the sorting and joining with the hashes instead of the strings themselves an dat the "translate" back the hashes to strings, but leave the numbers intact at the end of the lines. Since there would be only 150K hashes it's not that bad. What would be a good way to calculate individual hashes for each of the strings? Some AWK magic?
See file samples at the end.
Thank you!
sample of huge.txt
had stirred me to 46
had stirred my corruption 57
had stirred old emotions 55
had stirred something in 69
had stirred something within 40
sample of small.txt
caley 114881
calf 2757974
calfed 137861
calfee 71143
calflora 154624
calfskin 148347
calgary 9416465
calgon's 94846