Quickly compute added and removed lines
Posted
by
Philippe Marschall
on Programmers
See other posts from Programmers
or by Philippe Marschall
Published on 2012-10-29T10:11:42Z
Indexed on
2012/10/29
11:17 UTC
Read the original article
Hit count: 333
algorithms
|file-handling
I'm trying to compare two text files. I want to compute how many lines were added and removed. Basically what git diff --stat
is doing. Bonus points for not having to store the entire file contents in memory.
The approach I'm currently having in mind is:
- read each line of the old file
- compute a hash (probably MD5 or SHA-1) for each line
- store the hashes in a set
- do the same for each line in the new file
- every hash from the old file set that's missing in the new file set was removed
- every hash from the new file set that's missing in the old file set was added
I'll probably want to exclude empty and all white space lines. There is a small issue with duplicated lines. This can either be solved by additionally storing how often a hash appears or comparing the number of lines in the old and new file and adjust either the added or removed lines so that the numbers add up.
Do you see room for improvements or a better approach?
© Programmers or respective owner