| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 0xbadcafebee 237 days ago

There's a ton of jargon here. Summarized...

Why EBS didn't work:

  - EBS costs for allocation
  - EBS is slow at restores from snapshot (faster to spin up a database from a Postgres backup stored in S3 than from an EBS snapshot in S3)
  - EBS only lets you attach 24 volumes per instance
  - EBS only lets you resize once every 6–24 hours, you can't shrink or adjust continuously
  - Detaching and reattaching EBS volumes can take 10s for healthy volumes to 20m for failed ones, so failover takes longer

Why all this matters:

  - their AI agents are all ephemeral snapshots; they constantly destroy and rebuild EBS volumes

What didn't work:

  - local NVMe/bare metal: need 2-3x nodes for durability, too expensive; snapshot restores are too slow
  - custom page-server psql storage architecture: too complex/expensive to maintain

Their solution:

  - block COWs
  - volume changes (new/snapshot/delete) are a metadata change
  - storage space is logical (effectively infinite) not bound to disk primitives
  - multi-tenant by default
  - versioned, replicated k/v transactions, horizontally scalable
  - independent service layer abstracts blocks into volumes, is the security/tenant boundary, enforces limits
  - user-space block device, pins i/o queues to cpus, supports zero-copy, resizing; depends on Linux primitives for performance limits

Performance stats (single volume):

  - (latency/IOPS benchmarks: 4 KB blocks; throughput benchmarks: 512 KB blocks)
  - read: 110,000 IOPS and 1.375 GB/s (bottlenecked by network bandwidth
  - write: 40,000–67,000 IOPS and 500–700 MB/s, synchronousy replicated
  - single-block read latency ~1 ms, write latency ~5 ms

8 comments

jread 237 days ago

I'm working on graduate research evaluating AWS control and data plane performance.

EBS volume attachment is typically ~11s for GP2/GP3 and ~20-25s for other types.

1ms read / 5ms write latencies seem high for 4k blocks. IO1/IO2 is typically ~0.5ms RW, and GP2/GP3 ~0.6ms read and ~0.94ms write.

References: https://cloudlooking.glass/matrix/#aws.ebs.us-east-1--cp--at... https://cloudlooking.glass/matrix/#aws.ebs.*--dp--rand-*&aws...

link

0xbadcafebee 237 days ago

You might want to add the bit from the blog about worst-case attach times to your research. From my own experience (though it was years ago), sometimes an EBS volume would fail and simply never return. Definitely won't be acceptable for some use cases

link

jread 237 days ago

Yes, we've been testing volume attachments every 5m since start of the year, and have experienced 100-150 attachment failures per volume type in that time frame during multiple events (most recently last week):

https://cloudlooking.glass/dashboard/#aws.ebs.us-east-1--cp-...

Another interesting bit, is last March AWS changed something in the control plane which both triggered a multi-day LSE, and ultimately increased attachment times from 2-3s to 10-20s (also visible in the graphs).

link

hedora 237 days ago

Thanks for the summary.

Note that those numbers are terrible vs. a physical disk, especially latency, which should be < 1ms read, << 1ms write.

(That assumes async replication of the write ahead log to a secondary. Otherwise, write latency should be ~ 1 rtt, which is still << 5ms.)

Stacking storage like this isn’t great, but PG wasn’t really designed for performance or HA. (I don’t have a better concrete solution for ansi SQL that works today.)

link

mfreed 237 days ago

A few datapoints that might help frame this:

- EBS typically operates in the millisecond range. AWS' own documentation suggests "several milliseconds"; our own experience with EBS is 1-2 ms. Reads/writes to local disk alone are certainly faster, but it's more meaningful to compare this against other forms of network-attached storage.

- If durability matters, async replication isn't really the right baseline for local disk setups. Most production deployments of Postgres/databases rely on synchronous replication -- or "semi-sync," which still waits for at least one or a subset of acknowledgments before committing -- which in the cloud lands you in the single-digit millisecond range for writes again.

link

graveland 237 days ago

(I'm on the team that made this)

The raw numbers are one thing, but the overall performance of pg is another. If you check out https://planetscale.com/blog/benchmarking-postgres-17-vs-18 for example, in the average QPS chart, you can see that there isn't a very large difference in QPS between GP3 at 10k iops and NVMe at 300k iops.

So currently I wouldn't recommend this new storage for the highest end workloads, but it's also a beta project that's still got a lot of room for growth! I'm very enthusiastic about how far we can take this!

link

samlambert 237 days ago

it's a 70% difference at lower cost. i know math is hard but c'mon try and be serious.

link

znpy 237 days ago

Reminds me of about ten years ago when a large media customer was running NetApp on cloud to get most of what you just wrote on AWS (because EBS features sucked/sucks very bad and are also crazy expensive).

I did not set that up myself, but the colleague that worked on that told me that enabling tcp multipath for iscsi yielded significant performance gains.

link

_rs 237 days ago

> Detaching and reattaching EBS volumes can take 10s for healthy volumes to 20m for failed ones

Is there a source for the 20m time limit for failed EBS volumes? I experienced this at work for the first time recently but couldn't find anything documenting the 20m SLA (and it did take just about 20 full minutes).

link

mfreed 237 days ago

I'm not aware of any published source for this time limit, nor ways to reduce it.

The docs do say, however, "If the volume has been impaired for more than 20 minutes, you can contact the AWS Support Center." [0] which suggests its some expected cleanup/remount interval.

That is, it is something that we regularly encounter when EC2 instances fail, so we were sharing from personal experience.

[0] https://docs.aws.amazon.com/ebs/latest/userguide/work_volume...

link

lisperforlife 237 days ago

The 5ms write latency and 1ms write latency sounds like they are using S3 to store and retrieve data with some local cache. My guess is a S3 based block storage exposed as a network block device. S3 supports compare-and-swap operations (Put-If-Match), so you can do a copy-on-write scenario quite easily. May be somebody from TigerData can give a little bit more insight into this. I know slatedb supports S3 as a backend for their key-value store. We can build a block device abstraction using that.

link

mfreed 237 days ago

None of this. It's in the blog post in a lot of detail =)

The 5ms write latency is because the backend distributed block storage layer is doing synchronous replication to multiple servers for high availability and durability before ack'ing a write. (And this path has not yet been super-performance-optimized for latency, to be honest.)

link

bradyd 237 days ago

> EBS only lets you resize once every 6–24 hours

Is that even true? I've resized an EBS instance a few minutes after another resize before.

link

electroly 237 days ago

AWS documents it as "After modifying a volume, you must wait at least six hours and ensure that the volume is in the in-use or available state before you can modify the same volume" but community posts suggest you can get up to 8 resizes in the six hour window.

link

jasonthorsness 237 days ago

The 6-hour counter is most certainly, painfully true. If you work with an AWS rep please complain about this in every session; maybe if we all do they will reduce the counter :P.

link

thesz 237 days ago

What does EBS mean?

It is used in first line of the text but no explanation was given.

link

karanbhangui 237 days ago

https://aws.amazon.com/ebs/

link

samat 237 days ago

Excellent tl;dr! Would pay to get them for every worthwhile tech article.

link

akulkarni 237 days ago

link