Quickly compute added and removed lines

Posted by Philippe Marschall on Programmers See other posts from Programmers or by Philippe Marschall
Published on 2012-10-29T10:11:42Z Indexed on 2012/10/29 11:17 UTC
Read the original article Hit count: 335

Filed under:
|

I'm trying to compare two text files. I want to compute how many lines were added and removed. Basically what git diff --stat is doing. Bonus points for not having to store the entire file contents in memory.

The approach I'm currently having in mind is:

  1. read each line of the old file
  2. compute a hash (probably MD5 or SHA-1) for each line
  3. store the hashes in a set
  4. do the same for each line in the new file
  5. every hash from the old file set that's missing in the new file set was removed
  6. every hash from the new file set that's missing in the old file set was added

I'll probably want to exclude empty and all white space lines. There is a small issue with duplicated lines. This can either be solved by additionally storing how often a hash appears or comparing the number of lines in the old and new file and adjust either the added or removed lines so that the numbers add up.

Do you see room for improvements or a better approach?

© Programmers or respective owner

Related posts about algorithms

Related posts about file-handling