Encrypted backups for paranoiacs

Needs and assumptions

Let's suppose we want to store backups of a server called SOURCE on N>=1 remote servers called REMOTE(n) (n between 1 and N), with the following needs :

  • confidentiality :
    • in respect to the aliens abducting SOURCE ;
    • in respect to REMOTE(n) : the files we need to backup should never be available as plaintext to the remote boxes.
  • tamper-proof : we should be able to check, at restoration time, if the backups stored on REMOTE(n) have been generated by the legitimate SOURCE and not modified in the meantime.
  • automation : the backup process should run unattended.
  • resistance to subtle backup deletion methods : anyone having punctual root access to SOURCE (e.g. by abducting it) or permanent root access to REMOTE(n) should not have the power to annihilate our whole backup scheme.
  • we do not want to give REMOTE(n) the ability to run arbitraty commands on SOURCE
  • recoverability (uh !)

Let's also consider the following technical constraints :

  • SOURCE and REMOTE(n) boxes run some flavour of Unix operating system ;
  • we have root access to SOURCE, but are only given a non-root user account on every REMOTE(n).

The following is an attempt to imagine and implement a software suite complying with these specifications.

Solutions

Confidentiality & tamper-proof propriety

Encrypting the data on-disk should be sufficient to achieve confidentiality in respect to the aliens abducting SOURCE ; make sure you read LinuxCryptoFS before you choose your encryption solution between the various available ones.

We have now to ensure that the files' plaintext never leave SOURCE, by locally encrypting the backups. We will use GnuPG, as it is commonly available on Unix systems ; we have to choose between the two cryptographic schemes it provides :

  • symmetric : the same passphrase is used to encrypt and decrypt data, no public or private key is required ; since we want the backup process to run unattended, it implies to store this passphrase as plaintext on SOURCE ;
  • asymmetric : a public key is needed for the encryption, a private key is used for the decryption. The latter is generally protected by a passphrase, but it is not mandatory ; if it's the case, the passphrase has to be stored as plaintext on SOURCE. This is the way duplicity works, when run with the "--encrypt-key" command-line option set.

There are at least three ways to achieve tamper-proof propriety :

  • to generate a hash-code for each backup file with a one-way hash function, and store these checksums in plaintext in a "safe" place, of course not along with the corresponding backup files ; such a "secure" place is hard to maintain, especially when, as in our case, it has to be online ; as a consequence, we will avoid this solution ;
  • to generate a digital signature for each backup file, using an asymmetric cryptographic system such as the one provided by GnuPG : a private key, optionally protected by a passphrase, is used to encrypt each backup file's hash-code ; that's the way duplicity works, when run with the "--sign-key" option set.
  • to generate a hash-code for each backup file with a one-way hash function, then encrypt it with GnuPG's symmetric encryption.

The problem we have to solve is a key-management one.

A first reflex could be to choose a solution that avoids to store a private key on SOURCE ; it's stupid actually, since someone who would gain access to such a key would as well gain access to the plaintext of the files we want to backup. For the same reason, storing on SOURCE the passphrases used to encrypt our backups is not to be considered as a security flaw.

Let's consider the problem from another point of view : for obvious reasons, any key or passphrase needed to recover the backups should be itself back-up'd somewhere. A combination of human and computer memories seems optimal for this. Therefore, the asymmetric scheme looks like the most suitable. What we are going to do is :

  • store on SOURCE the key-pair and the passphrase protecting the private key ;
  • encrypt our backups with the public key ;
  • generate a digital signature for each backup file with the private key and its passphrase ;
  • store the digital signatures along with the backup files, on REMOTE(n) ;
  • backup the key-pair in a "secure" place ;
  • remember the passphrase.

Misc. ideas about "incremental-ism"

"Incremental-ism" is generally useful to : (1) have the ability to revert to a previous state of the backup'd files, which is not possible using a basic synchronized mirror scheme ; (1) limit the used bandwidth ; (1) limit the needed disk space.

But... even if the clear-text changes only a bit, its encrypted form changes a lot ; therefore, the deltas have to be made at the clear-text level.

That's why we need :

  • one layer to make deltas of clear-text ;
  • one layer to encrypt these deltas ;
  • one layer to upload the encrypted files.

To make deltas often implies to keep accessible to SOURCE either old signatures (if granularity = file) or whole old data (if granularity = line). We want to avoid the latter for SOURCE's disk space reasons.

Methods to make deltas (and more) :

  • backeupe - good old tar-based version of this home-made tool : needs only the last backup's date ;
  • duplicity : makes deltas with librsync-based tools ; needs only the old signatures ;
  • backeupe - if rewritten to use librsync-based tools, such as rdiffdir (bundled with duplicity) : same functionnality as duplicity, but... does not exist yet ;
  • rdiff-backup in local->local mode : the whole old backups have to be stored on SOURCE ;
  • rdiff-backup in local->remote mode : prevents us from encrypting the backups, disqualified.

As you can see, rdiff-backup is not suitable for us.

Resistance to subtle backup deletion methods

We don't want to give SOURCE the power to delete the backups stored on REMOTE(n) ; I mean, either if the aliens abduct SOURCE, or if someone breaks into it and gets root access, they should not be able to delete the backups stored on every REMOTE(n) :

  • neither directly (e.g. because of a silly push-backup method with passwordless ssh keys giving SOURCE write-access to the directory storing the backups on REMOTE(n)) ;
  • nor indirectly (e.g. by deleting the files on SOURCE 5 minutes before a cronjob on REMOTE(n) have rsync update its mirror, which means, in this case, delete the backups).

NB: in the next paragraphs, I mean by "cleaning" the process of deleting the old backups that are not needed anymore.

False good-ideas

What won't work, and why :

  • as soon as SOURCE is allowed to "scp" files to REMOTE(n), only "chmod 400" can prevent an attacker to overwrite existing backups ;
  • on REMOTE(n), move the backups to a inaccessible directory, or chown to some other user : impossible with only a non-root account ;
  • on REMOTE(n), cron chmod 400 : does not forbid "rm" if the account used on REMOTE(n) has full ssh access ;
  • rdiff-backup in local->remote mode : no way to prevent the account from deleting the backups, since remote files deletion is strongly tied into the rdiff-backup process.

push-backup solutions

duplicity to a directory with sticky-bit & backups periodically chown'd on REMOTE(n)

  • we have to restrict the SOURCE->REMOTE(n) ssh-key to scp, echo, ls and sftp-server commands (needed since duplicity >=0.4.2), using validate-duplicity.sh in authorized_keys file on REMOTE(n) :

    command="~/.ssh/validate-duplicity.sh",from="boum.org",no-port-forwarding,no-X11-forwarding,no-pty ssh-dss ...

  • this is not enough, since sftp enables to delete files ; a workaround is to upload files to a directory with sticky bit set and owned by root, then use balayette (git clone git://gaffer.ptitcanardnoir.org/balayette.git) to periodically chown the backups to root (and chmod them 640, so that duplicity is still able to read them) ; the backup directory tree should then be :

    • /srv/backups owned by root:backup-user, with permissions 0750
    • /svr/backups/destination1 owned by root:root, with permissions 1777
    • /svr/backups/destination2 owned by root:root, with permissions 1777
  • validate-duplicity.sh, .ssh and authorized_keys files on REMOTE(n) should also be protected, so that SOURCE can not overwrite/delete them :
    • $HOME owned by root:usergroup, with permissions 750 (to prevent $USER to rename .ssh and create a new one)
    • .ssh owned by root:root, with permissions 755
    • .ssh/authorized_keys owned by root:root, with permissions 644
    • .ssh/validate-duplicity.sh owned by root:root, with permissions 755
  • N=1 is enough... but can be increased, depending how you entrust the REMOTE(n) boxes security & reliability
  • disk space on SOURCE : a few megabytes when generating backups, that's all :)
  • house-cleaning on SOURCE : none, since duplicity manages this
  • house-cleaning on REMOTE(n) : by home-made cronjob
  • disk space on every REMOTE(n) : 1 full + related deltas, x2 between a new full backup upload and the next house-cleaning
  • the three layers are combined in one single piece of software that is well supported by backupninja

protected rsync

  • it's possible to tie the SOURCE->REMOTE ssh-key to a single rsync command including :
    • '--max-delete=0' to ensure that no file is deleted
    • '--ignore-existing' to tell rsync not to update files that already exist on REMOTE(n)
  • any local->local incremental encrypted backup solution, such as duplicity or backeupe, is used on SOURCE to generate the backups and to keep only the latest needed ones
  • N=1 is enough... but can be increased, depending how you entrust the REMOTE(n) boxes security
  • house-cleaning on REMOTE(n) : by a home-made cronjob
  • disk space on every REMOTE(n) : 1 full + related deltas, x2 between a new full backup upload and the next house-cleaning
  • disk space on SOURCE : 1 full + related deltas, x2 when creating backups
  • disk space on every REMOTE(n) : 2 * (full + incr)
  • robust
  • "chmod 400" periodically the backups on REMOTE(n) won't hurt

pull-backup solutions

The idea is : any local->local incremental encrypted backup solution, such as duplicity or backeupe, is used on SOURCE to generate the backups and to keep only the latest needed ones. These files are then downloaded by REMOTE(n), using one of the various solutions described bellow.

The problem inherent to this method is : at least from time to time, at least one full backup and the related deltas have to be stored simultaneously on SOURCE. Depending on the way how SOURCE and REMOTE synchronize themselves (or not), this disk-space need can be doubled and/or permanent.

rdiff-backup

  • N=1 is enough... but can be increased, depending how you entrust the REMOTE(n) boxes
  • it seems possible to restrict the SOURCE->REMOTE(n) ssh-key to one single command
  • disk space on REMOTE(n) >= 2 x (1 full + related deltas), i.e. huge
  • house-cleaning on REMOTE(n) : just have to run "rdiff-backup --remove-older-than" after the backup process
  • synchronization between SOURCE & REMOTE(n) old backups house-cleaning processes : incremental backup of incremental backups can cause headaches, disqualified.

read-only rsync server

Notes :

  • it's possible to restrict a rsync server to a read-only chroot, and to talk over ssh
  • an ssh account is needed ; how to limit him/her to this ? PAM ?
  • how secure is a rsync server ? chroot evasion ? vulnerabilities in the past ?
with protected rsync client
  • N=1 is enough... but can be increased, depending how you entrust the REMOTE(n) boxes
  • the "subtle backup deletion" protection is achieved in the same way as described for the "protected rsync" push method
  • house-cleaning on REMOTE(n) : by a home-made cronjob
  • disk space on each REMOTE(n) = 1 full + related deltas, x2 when creating backups
with redundancy, as documented on http://docs.indymedia.org/view/Sysadmin/PullBackupsForParanoiacs
  • n>=3 necessary
  • house-cleaning on REMOTE(n) : none (rsync does it)
  • disk space on each REMOTE(n) = 1 full + related deltas

read-only FTP server

  • virtual user possible, i.e. REMOTE(n) does not need access to a shell account on SOURCE
  • (pull) download process : a home-made script downloads the backups newer than the latest it already has
  • house-cleaning on REMOTE(n) : by a home-made cronjob
  • N=1 is enough... but can be increased, depending how you entrust the REMOTE(n) boxes
  • disk space on each REMOTE(n) = 1 full + related deltas, x2 when downloading backups

Discussion

I've personnally chosen the duplicity solution, which is the only one that requires almost no disk-space on SOURCE. You can choose whatever you want, depending on your own constraints :)

Appendix

Generate new full backups from time to time

The easiest way is to have to backupninja jobs :

  • the more frequent one (something like one a day) does the incremental backup ;
  • a far less frequent one (something like once every two monthes) performs a new full backup.

When a new full backup is performed, using duplicity, the regular one will start doing performing incremental backups against this one. And house-cleaning will happen on the backup host.

Alternative : duplicity 0.4.4 has a new option, called --full-if-older-than=, that does exactly what we want.

house-cleaning

Lots of the above solutions need a way to limit the necessary disk space, by deleting all but the most recent backups from SOURCE and/or REMOTE(n). How ? Well, balayette (git clone git://gaffer.ptitcanardnoir.org/balayette.git) should do the job, and, as a bonus, is able to chmod and/or chown the backup files, as needed by the duplicity setup described above.

How to limit bandwidth usage ?