Hacker News new | ask | show | jobs
by throwdbaaway 1969 days ago
Fine, with enough replicas, you can sleep well at night. But how about the 3 years uptime without reboot? Can you really enjoy your morning coffee without thinking about it? :)

Netflix went full ephemeral storage for their Cassandra clusters since the beginning, at the time when they were just spinning disks. Years later, they still insist on doing this, and had to come up with creative solution to fix the uptime issue: https://netflixtechblog.medium.com/datastore-flash-upgrades-...

2 comments

From the parent comment:

> it takes like an hour to catch up a new replica from scratch

That means it should be pretty easy to replace an instance - just create a new replica from scratch, then fail over (if you're replacing the current master/primary instance), and remove the old one.

Why can't you reboot? NVMe storage is local, not ephemeral.
Yes, you can reboot, but you have to update in-place, instead of rolling out a new OS image. Or you adopt the Netflix approach. There are also some additional restrictions, e.g. you can't change the machine type.

Anyway, from the other comment here, I think tommyzli might not have realized that a reboot is still possible, which would partially explain the 3 years uptime.