Tracking unique versions of files with hashes
- by rwmnau
I'm going to be tracking different versions of potentially millions of different files, and my intent is to hash them to determine I've already seen that particular version of the file. Currently, I'm only using MD5 (the product is still in development, so it's never dealt with millions of files yet), which is clearly not long enough to avoid collisions.
However, here's my question - Am I more likely to avoid collisions if I hash the file using two different methods and store both hashes (say, SHA1 and MD5), or if I pick a single, longer hash (like SHA256) and rely on that alone? I know option 1 has 288 hash bits and option 2 has only 256, but assume my two choices are the same total hash length.
Since I'm dealing with potentially millions of files (and multiple versions of those files over time), I'd like to do what I can to avoid collisions. However, CPU time isn't (completely) free, so I'm interested in how the community feels about the tradeoff - is adding more bits to my hash proportionally more expensive to compute, and are there any advantages to multiple different hashes as opposed to a single, longer hash, given an equal number of bits in both solutions?