Secure, efficient, version-preserving, filename-hiding backup implemented in this way?
Posted
by
barrycarter
on Super User
See other posts from Super User
or by barrycarter
Published on 2011-01-16T18:42:30Z
Indexed on
2011/01/16
18:55 UTC
Read the original article
Hit count: 265
backup
|incremental-backup
I tried writing a "perfect" backup program (below), but ran into problems (also below). Is there an efficient/working version of this?:
Assumptions: you're backing up from 'local', which you own and has limited disk space to 'remote', which has infinite disk space and belongs to someone else, so you need encryption. Network bandwidth is finite.
'local' keeps a db of backed-up files w/ this data for each file:
filename, including full path
file's last modified time (mtime)
sha1sum of file's unencrypted contents
sha1sum of file's encrypted contents
Given a list of files to backup (some perhaps already backed up), the program runs 'find' and gets the full path/mtime for each file (this is fairly efficient; conversely, computing the sha1sum of each file would NOT be efficient)
The program discards files whose filename and mtime are in 'local' db.
The program now computes the sha1sum of the (unencrypted contents of each remaining file.
If the sha1sum matches one in 'local' db, we create a special entry in 'local' db that points this file/mtime to the file/mtime of the existing entry. Effectively, we're saying "we have a backup of this file's contents, but under another filename, so no need to back it up again".
For each remaining file, we encrypt the file, take the sha1sum of the encrypted file's contents, rsync the file to its sha1sum. Example: if the file's encrypted sha1sum was da39a3ee5e6b4b0d3255bfef95601890afd80709, we'd rsync it to /some/path/da/39/a3/da39a3ee5e6b4b0d3255bfef95601890afd80709 on 'remote'.
Once the step above succeeds, we add the file to the 'local' db.
Note that we efficiently avoid computing sha1sums and encrypting unless absolutely necessary.
Note: I don't specify encryption method: this would be user's choice.
The problems:
We must encrypt and backup 'local' db regularly. However, 'local' db grows quickly and rsync'ing encrypted files is inefficient, since a small change in 'local' db means a big change in the encrypted version of 'local' db.
We create a file on 'remote' for each file on 'local', which is ugly and excessive.
We query 'local' db frequently. Even w/ indexes, these queries are slow, since we're often making one query for each file. Would be nice to speed this up by batching queries or something.
Probably other problems that I've now forgotten.
© Super User or respective owner