Hacker News new | ask | show | jobs
by tommyzli 1969 days ago
We are living life on the edge to an extent, but we have 5 hot standbys across AZs and regular backups + WAL archives to S3.

May not be as durable as EBS, but it's enough for me to sleep soundly at night. And with a highly concurrent WAL-G download, it takes like an hour to catch up a new replica from scratch.

1 comments

Fine, with enough replicas, you can sleep well at night. But how about the 3 years uptime without reboot? Can you really enjoy your morning coffee without thinking about it? :)

Netflix went full ephemeral storage for their Cassandra clusters since the beginning, at the time when they were just spinning disks. Years later, they still insist on doing this, and had to come up with creative solution to fix the uptime issue: https://netflixtechblog.medium.com/datastore-flash-upgrades-...

From the parent comment:

> it takes like an hour to catch up a new replica from scratch

That means it should be pretty easy to replace an instance - just create a new replica from scratch, then fail over (if you're replacing the current master/primary instance), and remove the old one.

Why can't you reboot? NVMe storage is local, not ephemeral.
Yes, you can reboot, but you have to update in-place, instead of rolling out a new OS image. Or you adopt the Netflix approach. There are also some additional restrictions, e.g. you can't change the machine type.

Anyway, from the other comment here, I think tommyzli might not have realized that a reboot is still possible, which would partially explain the 3 years uptime.