Hacker News new | ask | show | jobs
by derefr 2341 days ago
A question for HN: what filesystem and/or block-device abstraction layer would you use on a database server, if you wanted to perform scheduled incremental backups using filesystem-level consistent snapshotting and differential snapshot shipping to object storage, instead of using the DBMS’s own replication layer to achieve this effect? (I.e. you want disaster recovery, not high availability.)

Or, to put that another way: what are AWS and GCP using in their SANs (EBS; GCE PD) that allows them to take on-demand incremental snapshots of SAN volumes, and then ship those snapshots away from the origin node into safer out-of-cluster replicated storage (e.g. object storage)? It it proprietary, or is it just several FOSS technologies glued together?

My naive guess would be that the cloud hosts are either using ZFS volumes, or LVM LVs (which do have incremental snapshot capability, if the disk is created in a thin pool) under iSCSI. (Or they’re relying on whatever point-solution VMware et al sold them.)

If you control the filesystem layer (i.e. you don’t need to be filesystem-agnostic), would Btrfs snapshots be better for this same use-case?

9 comments

AWS and GCP most likely use their own proprietary stuff. Various storage systems (such as Netapp) are able to provide snapshots at the storage system level, and if you're interested in something open source, a Ceph cluster can also provide you snapshottable block devices; whether it's a good idea for a database is another question.

Filesystem snapshots are a legitimate way of backing up databases, but it's not quite as simple as just taking a snapshot. For PostgreSQL for example you will still need to call pg_start_backup() and ensure your WAL archives are properly stored in your object storage system for point-in-time recovery. Without the database-specific precautions, your snapshots will still be crash-consistent and most likely usable in some manner, but not quite proper backups.

Using BTRFS or ZFS as the database filesystem has its own footguns. For example, the default record size of ZFS datasets doesn't match the block size of most databases, so if you forget to take that into account, you'll very likely see rather terrible performance.

If the database is PostgreSQL, I would strongly advise about forgetting about filesystem snapshots and instead using streamed backups (if on premises use barman, if in cloud WAL-E or WAL-G (never used it but looks like improvement over WAL-E).

This gives you backup with a replay value, so you can restore at any point in time. You can also use such backup for setting up replication. There's still a daily backup which is there to speed up recovery and increase resiliency. Those backups don't really put much load on the database, but if that's a concern you can back up the replica (which is what cloud providers or at least AWS is doing).

As for ZFS, out of the box ZFS is not a good file system for databases, although you can get a good performance after tuning. You for example want to configure it to have block sizes alizened with database blocks, configure ZIL, perhaps changing block hashing algorithms (although I think current default should be fast).

As for your question how are cloud providers are doing it, most of us can speculate. To me it looks like standard RDS instances are simply on EBS (which are utilizing S3). In Aurora they skipped EBS and implemented another database storage directly.

It seems like the backups are performed in traditional way though.

I do not think it would be a good idea to use file system level snapshotting for backing up a database. The database "knows better" about its internals, and can give more guarantees about the consistency of its data. I would trust a filesystem-levdl backup only as a last resort.
It is possible to put database in state that is "ready" for snapshot, pushing changes to disk and sort of freezing I/O during snapshot.
this is generally not a matter of concern for a copy on write filesystem like zfs, since it's not possible for the file to be in an "in between" state. If a write were in progress, the filesystem would still be pointing to the previous state. Only when the data is written to disk is the pointer moved to the new location.
It very much is a concern. ZFS has no knowledge about the internals of a database, which parts of a file are related to each other etc.
DBMSes always keep their database in the file system in a consistent state to be able to recover from system crashes. Taking a file system snapshot is equivalent to pulling the power on the database server in terms of data recovery, but databases are designed to support this.
As do filesystems. Yet I've seen anyone argue that cutting the power is the recommended way of doing backups.

In fact the opposite, make sure to use an UPS just so that you can shutdown cleanly in the unfortunate event.

For example: https://blogs.oracle.com/paulie/backing-up-mysql-using-zfs-s...

It matters when various writes to files need to be ordered, and the DB processes is keeping some of the consistency part in memory.
It is a matter of concern if said database systems leaves its filesystem contents in an inconsistent state at any point. ZFS, BTRFS, and others can only keep consistent what they have control over.
I also think database specific backup makes more sense.

Some people recommend filesystem snapshotting but wouldn't that make recovery a slow process because you have to load up the entire database even if you just wanted to look up on data of a small table?

Maybe backing up only small tables as SQL dumps while keeping a file system snapshot would be a good compromise.

Create a LVM thinpool, create a thin LV in that, format with XFS, put the database on top.

Each time you want to a backup, create a CoW snapshot of the thin LV, then mount it somewhere and run the backup.

The "main" thin LV should be happily chugging along independently when you are doing that.

And all this is stable proven technology available about anywhere (eq. RHEL 7+).

My understanding is that performance on LVM drops dramatically after the first snapshot due to the way it handles CoW (synchronous writes on top of your async write). Is that no longer true, or only in certain circumstances?

It seems as though the way to go would be to take a 'snapshot', back it up, and then delete it immediately; is that right?

https://www.nikhef.nl/~dennisvd/lvmcrap.html

I think this complaint applies to the original LVM2 snapshots, not the new thin ones.
>> Or, to put that another way: what are AWS and GCP using in their SANs (EBS; GCE PD) that allows them to take on-demand incremental snapshots of SAN volumes, and then ship those snapshots away from the origin node into safer out-of-cluster replicated storage (e.g. object storage)?

As far as I know AWS does not use SANs because they consider it as anti-pattern. Most backups land on S3 because of reliability and price.

SAN and s3 are different beasts.

EBS is very much a SAN, if you read the docs, the Nitro HBA Controllers have dedicated bandwidth allocation for doing just EBS.

As there is a dedicated network for just servicing block storage, that sounds suspiciously like a Storage Area Network to me.

S3 for backup makes lots of sense, its ubiqutous, reliable and smeared over lots of regions. It also works well with large files. Its also orders of magnitude cheaper than EBS to run.

Sure thing. I was referring to the lack of SAN in the context of backups. Yes, EBS is a SAN in that sense.
So how is S3 implemented? Does it reuse any publicly available open source component?
I don’t think they’ve published anything specifically on S3’s architecture (someone please correct me if I’m wrong, I last looked into this a long time ago), but

1. they came out with S3 soon after coming out with their Dynamo paper (before releasing DynamoDB, even); and

2. there’s a good constructive proof, as a studyable FOSS system, for how to build object storage on top of a Dynamo architecture, in the form of Riak CS (object storage) which is built atop Riak KV (a Dynamo impl.) Riak CS seems to make pretty much the same set of guarantees (in terms of time/space complexity of operations, possible durability numbers per scaled number of copies, etc.) that S3 does, so it’s a fair guess that they’re similarly-architected systems.

It is a closed source project that has many components. I am not aware if any of those are opensource.
Assuming you can afford 2 machines this setup works pretty well for me.

Primary DB-Server -> XFS on LVM with a LVM caching SSD

Secondary (write-only) mirror DB-Server -> ZFS

The DB is replicating automatically to the secondary server by the database internal replication features. At the secondary I am then able to lock the DB temporarily, doing a ZFS snapshot and maybe could do a ZFS send/receive afterwards without affecting the primary server.

ZFS is great for doing snapshots and archiving of huge amount of data, but it's very very bad for production databases in terms of performance. Most database aren't designed to deal with the CoW feature of ZFS, which leads to a very bad write performance and database fragmentation in the end.

Zfs

Btrfs

See various "Private Cloud" Linux distributions for implementation examples of this. Such as Proxmox which does it out of the box on ZFS and soon on btrfs too.

xfs + dm-snapshot / lvm snapshots. Very fast and very reliable.
AWS, and possibly GCP only allow 1:1 mapping of volumes (publicly, I know AWS allow it under the hood. )

Which makes synchronising snaphots a lot easier (and caching too, but thats another thing entirely.)

They are treated as block storage, so on the outside don't have to worry about what filesystem is running on it. (in practice they have to be a bit aware, so that they don't snapshot unbootable or dirty images, but I assume thats mostly handled by an OS plugin)

TL;DR:

AWS et al snapshots are at the block level. Linux has poorly documented primitives for this.

If you put your VM images on a Filesystem provided by ZFS or BTRFS then you can snapshot your images, without having to buy a SAN, or expensive controller.

ZFS has by far the best documentation. BTRFS's documentation has improved, but the tools are still difficult to use.