Quickly compute added and removed lines
- by Philippe Marschall
I'm trying to compare two text files. I want to compute how many lines were added and removed. Basically what git diff --stat is doing. Bonus points for not having to store the entire file contents in memory.
The approach I'm currently having in mind is:
read each line of the old file
compute a hash (probably MD5 or SHA-1) for each line
store the hashes in a set
do the same for each line in the new file
every hash from the old file set that's missing in the new file set was removed
every hash from the new file set that's missing in the old file set was added
I'll probably want to exclude empty and all white space lines. There is a small issue with duplicated lines. This can either be solved by additionally storing how often a hash appears or comparing the number of lines in the old and new file and adjust either the added or removed lines so that the numbers add up.
Do you see room for improvements or a better approach?