Secure, efficient, version-preserving, filename-hiding backup implemented in this way?

Posted by barrycarter on Super User See other posts from Super User or by barrycarter
Published on 2011-01-16T18:42:30Z Indexed on 2011/01/16 18:55 UTC
Read the original article Hit count: 269

Filed under:
|

I tried writing a "perfect" backup program (below), but ran into problems (also below). Is there an efficient/working version of this?:

  • Assumptions: you're backing up from 'local', which you own and has limited disk space to 'remote', which has infinite disk space and belongs to someone else, so you need encryption. Network bandwidth is finite.

  • 'local' keeps a db of backed-up files w/ this data for each file:

    • filename, including full path

    • file's last modified time (mtime)

    • sha1sum of file's unencrypted contents

    • sha1sum of file's encrypted contents

  • Given a list of files to backup (some perhaps already backed up), the program runs 'find' and gets the full path/mtime for each file (this is fairly efficient; conversely, computing the sha1sum of each file would NOT be efficient)

  • The program discards files whose filename and mtime are in 'local' db.

  • The program now computes the sha1sum of the (unencrypted contents of each remaining file.

  • If the sha1sum matches one in 'local' db, we create a special entry in 'local' db that points this file/mtime to the file/mtime of the existing entry. Effectively, we're saying "we have a backup of this file's contents, but under another filename, so no need to back it up again".

  • For each remaining file, we encrypt the file, take the sha1sum of the encrypted file's contents, rsync the file to its sha1sum. Example: if the file's encrypted sha1sum was da39a3ee5e6b4b0d3255bfef95601890afd80709, we'd rsync it to /some/path/da/39/a3/da39a3ee5e6b4b0d3255bfef95601890afd80709 on 'remote'.

  • Once the step above succeeds, we add the file to the 'local' db.

  • Note that we efficiently avoid computing sha1sums and encrypting unless absolutely necessary.

  • Note: I don't specify encryption method: this would be user's choice.

The problems:

  • We must encrypt and backup 'local' db regularly. However, 'local' db grows quickly and rsync'ing encrypted files is inefficient, since a small change in 'local' db means a big change in the encrypted version of 'local' db.

  • We create a file on 'remote' for each file on 'local', which is ugly and excessive.

  • We query 'local' db frequently. Even w/ indexes, these queries are slow, since we're often making one query for each file. Would be nice to speed this up by batching queries or something.

  • Probably other problems that I've now forgotten.

© Super User or respective owner

Related posts about backup

Related posts about incremental-backup