Yes, it’s the ephemerality that’s the biggest issue. Enterprise-grade SSDs are quite reliable, and typically have PLP so even in the event of a sudden power loss, any queued writes that the drive has accepted - and thus ack’d the fsync() - will be written. Presumably you’d be running some kind of redundancy, likely some flavor of RAID or zRAID (assuming purely local storage here, not a distributed system like Ceph, nor synchronous replication).
But in the cloud, if the physical server backing your instance dies, or even if someone accidentally issues a shutdown command, you don’t get that same drive back when the new instance comes up. So a problem that is normally solved by basic local redundancy suddenly becomes impossible, and thus you must either run synchronous or semi-sync replication (the latter is what PlanetScale Metal does), accepting the latency hit from distributed storage, or asynchronous replication and accept some amount of data loss, which is rarely acceptable.
... sounds like a trivial job for bare metal instances
and that EC2 local NVMe encryption keys are ephemeral is nice against leaks, but not a necessity for other clouds (and not great for resumability, which can really downgrade business continuity scores), and I expect for all the money they ask for it, to be able to keep it relatively secure even across reboots
Databases like Postgres have well established ways to handle that. And if you're setting up the DB yourself, you absolutely need to do backups anyway. And a replica on a different server.
On some providers (e.g. Hetzner), the dedicated servers come by default with 2x RAID 1 disks, so it's a lot less likely to fail (unless the datacenter burns down).
I would never ever trust OVH with any important data or servers, I mean we saw how they secured their datacenters where it took 3h to cut the power while the datacenter was burning.
Yes, a single disk in a VPS or cloud provider has durability concerns. That's why EBS and products like it that pretend to be a single disk are actually several. Instead of relying on multiple block devices, though, we create that redundancy at a higher level by relying on multiple MySQL or Postgres servers for durability, each with a local NVMe drive for performance.
Sure. Till an extent. And if you run some mission-critical application, definitely.
But most applications run fine from local storage and can tolerate some downtime. They might even benefit from the improved performance. You can also fix the durability and disaster recovery concerns by setting up on RAID/ZFS and maintaining proper backups.
yeh planetscale loves to flex how fast they are but the main reason they are fast is because they run a full abstraction less than any other cloud provider and this does in fact have trade-offs.
What is wrong with running without lots of abstractions? We are clear about the downsides. The results are clear, you can see the customers love it. We run insane amounts of state safely on ephemeral compute. It's a flex. All I've seen from Timescale people is qqing. Write some code or be quiet.
I'm not criticizing your engineering approach at all. Running everything in one box has its merits as your benchmarks show but it is also just not apples to apples there are other trade-offs and I am just appreciating that the community calls that out.
Also hey this is HN not Twitter I think we can be a bit more civilized. Not a good look imo for a CEO to get that upset over a harmless comment.
RAID isn't the answer, either, for the record. In AWS and GCP, the CPU or RAM blowing up will cost you access to that local NVMe drive, too, no matter how much RAID you throw at it.
But in the cloud, if the physical server backing your instance dies, or even if someone accidentally issues a shutdown command, you don’t get that same drive back when the new instance comes up. So a problem that is normally solved by basic local redundancy suddenly becomes impossible, and thus you must either run synchronous or semi-sync replication (the latter is what PlanetScale Metal does), accepting the latency hit from distributed storage, or asynchronous replication and accept some amount of data loss, which is rarely acceptable.