Hacker News new | ask | show | jobs
by sgarland 233 days ago
Yes, it’s the ephemerality that’s the biggest issue. Enterprise-grade SSDs are quite reliable, and typically have PLP so even in the event of a sudden power loss, any queued writes that the drive has accepted - and thus ack’d the fsync() - will be written. Presumably you’d be running some kind of redundancy, likely some flavor of RAID or zRAID (assuming purely local storage here, not a distributed system like Ceph, nor synchronous replication).

But in the cloud, if the physical server backing your instance dies, or even if someone accidentally issues a shutdown command, you don’t get that same drive back when the new instance comes up. So a problem that is normally solved by basic local redundancy suddenly becomes impossible, and thus you must either run synchronous or semi-sync replication (the latter is what PlanetScale Metal does), accepting the latency hit from distributed storage, or asynchronous replication and accept some amount of data loss, which is rarely acceptable.

2 comments

Agreed on these trade offs. We do both synchronous and semi-synchronous depending on Postgres or MySQL.
... sounds like a trivial job for bare metal instances

and that EC2 local NVMe encryption keys are ephemeral is nice against leaks, but not a necessity for other clouds (and not great for resumability, which can really downgrade business continuity scores), and I expect for all the money they ask for it, to be able to keep it relatively secure even across reboots

Or even a bare metal simple server that just does databases with redundant nvme ssd