|
> That’s a bit error rate of 1 in 10^15 requests. In the real world, we see that blade of grass get missed pretty frequently – and it’s actually something we need to account for in S3. One of the things I remember from my time at AWS was conversations about how 1 in a billion events end up being a daily occurrence when you're operating at S3 scale. Things that you'd normally mark off as so wildly improbable it's not worth worrying about, have to be considered, and handled. Glad to read about ShardStore, and especially the formal verification, property based testing etc. The previous generation of services were notoriously buggy, a very good example of the usual perils of organic growth (but at least really well designed such that they'd fail "safe", ensuring no data loss, something S3 engineers obsessed about). |
Yeah! With S3 averaging over 100M requests per second, 1 in a billion happens every ten seconds. And it's not just S3. For example, for Prime Day 2022, DynamoDB peaked at over 105M requests per second (just for the Amazon workload): https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f...
In the post, Andy also talks about Lightweight Formal Methods and the team's adoption of Rust. When even extremely low probability events are common, we need to invest in multiple layers of tooling and process around correctness.