How to use S3 for backups

I was recently bored, so like any normal person I decided to write a custom backup solution for S3. I do not claim that this is the best solution, in fact I know it is not, but I think it would be interesting to talk about the thoughts that went into it.

If you just want the script, you can find it here.

Requirements

Here are the things I wanted to achieve with this solution:

  • files are locally encrypted with my own keyfile
  • use cheapest possible storage
  • only upload files when they have actually changed
  • compress files as much as possible
  • allow restore of individual files to a degree.

Implementation – Terraform

Before we can start backing up things to S3, we first have to actually have an S3 bucket. The terraform configuration used can be seen on the gist linked above. I just want to get into some specific configurations used in more detail.

resource "aws_s3_bucket" "bucket" {
  bucket = "xyz"
  acl    = "private"
  region = "xyz"
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }
  versioning {
    enabled = true
  }
  lifecycle_rule {
    enabled = true

    noncurrent_version_expiration {
      days = 180
    }
  }
  tags = {
    project = "backup"
  }
}

For the most part it should be self-explanatory. I believe there is no good reason to not enable encryption at rest on AWS, ever.

The first interesting part is versioning. I decided very early on that I would leave versioning up to AWS, instead of implementing it myself. The reason for this is that this simplifies checking for changes a lot. Instead of having to find the latest version I uploaded, I can just check exactly the same filename. The downsinde is that this is more expensive if you have data that changes daily. If you have 180 days of a 1 G file, you will pay for 180 G, which is a whopping 0.72 USD a month on glacier. I also do not have to handle deletion of files myself, that is done by the lifecycle rule.

resource "aws_s3_bucket_public_access_block" "public_access_block" {
  bucket = aws_s3_bucket.bucket.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

If you do not want your data to be public, you should always enable the public_access_block. Basically the problem it solves is this: if you have a private bucket policy, it can still be overridden on a per-object level by setting an individual object to public-read. A lot of people accidentally set this and, as a result, expose data they do not want exposed. If you set public access block and try to put or update an object in a way that would make it accessible to the public, that request will get a 403 instead.

Implementation – Python

I will not go through the code line by line as that would be superfluous, but instead focus on a couple key parts that I found interesting or had to be done differently than expected.

Detecting file changes

The first important question is how file changes are actually detected. The answer, of course, is to use an MD5 hash. The first thing to realize here though is that if you use AES in any secure mode it will use an initialization vector (IV). This is basically a salt and means that if you encrypt the same file multiple times, the resulting encrypted file will have a different hash every time. That in turn means you have to get and store the MD5 hash before the file is encrypted.

Another note: I you use the command line tar an gzip, you have to do this:

tar cf - /my/dir | gzip --no-name > /myfile.tgz

If you use tar with the z option the resulting file will have a different hash every time, since per default gzip puts a last changed date into the files.

Storing MD5 hashes on AWS

There are three main ways you can store hashes when using S3

  • rely on the ETAG
  • S3 metadata
  • DynamoDB

On files that are small enough that multipart-upload is not needed, the ETAG of an S3 object is already the MD5 hash. For multi-part uploads that is not the case though, and the ETAG instead will be the hash of the part. You then need to reconstruct the correct hash.

Using custom metadata makes this a little easier and cheaper. By just doing

s3.meta.client.upload_file(
        Bucket    = bucket_name,
        Config    = transfer_config,
        Filename  = path,
        Key       = s3_key,
        ExtraArgs = {
            'ACL': 'private',
            'Metadata': { 'md5': md5_local_filesystem },
            'StorageClass': 'GLACIER'
        }
    )

you can store the md5 hash in custom metadata. This metadata can easily be received by using a head-request, for example:

aws s3api head-object --bucket $bucketname --key $keyname --query 'Metadata.md5'

The downside of using S3 for doing this at all, is that S3 requests are relatively expensive, especially on GLACIER. As such it is much cheaper to instead store the hashes in a DynamoDB table. This is very easy since we can use the – guaranteed unique – s3 key as the DynamoDB partition key. Both storing and retrieving this information is then very simple:

client.get_item(
        TableName       = dynamodb_table_name,
        Key             = {
            's3_key': {
                'S': s3_key
            }
        },
        AttributesToGet = [ 'md5' ]
    )

client.put_item(
        TableName = dynamodb_table_name,
        Item      = {
            's3_key': {
                'S': s3_key
            },
            'md5': {
                'S': md5_local_filesystem
            }
        }
    )

Encryption

I played around with implementing this myself, however crypto is hard to get right and easy to get wrong. Furthermore, I wanted to make sure I could decrypt the data everywhere – given that I still had the key. So I wanted to use standard command line tools. I settled on openssl as it makes it very simple to use a key file:

openssl enc -aes-256-cbc -pbkdf2 -pass file:/home/myuser/keyfile -in /my/file.tar.bz2 -out /my/file.tar.bz2.aes

And it is very simple to create a strong keyfile either using python:

with open(keyfile, 'wb') as f:
    f.write(os.urandom(4096))

or bash:

dd if=/dev/urandom of=/home/myuser/keyfile bs=1K count=4

Differential backups

I did not implement differential backups – 5 hours were not enough time. Also, I have not idea how to do that with encryption. So I settled an a slightly buggy solution. For every folder that I want to backup, I can define a ‘depth’. Given this folder structure:

/a/b/c/w
/a/b/d/x
/a/b/e/y
/a/b/e/z
/a/f

The following archives will be created, based on the depth setting:

depth=0: a
depth=1: b, f
depth=2: c, d, e
depth=3: w, x, y, z

The astute reader might notice that everything bigger than depth=1 will not back up f. This is due to this very concise, some might call it lazy, solution:

glob_pattern = path + '/*' * depth
return glob.glob(glob_pattern)

A proper solution here would be to write a recursive function that will only go up to depth levels, but only if there are subfolders to go into.

One thought on “How to use S3 for backups

Leave a Reply to NoNeedForAName Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: