| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Twirrim 1100 days ago

> That’s a bit error rate of 1 in 10^15 requests. In the real world, we see that blade of grass get missed pretty frequently – and it’s actually something we need to account for in S3.

One of the things I remember from my time at AWS was conversations about how 1 in a billion events end up being a daily occurrence when you're operating at S3 scale. Things that you'd normally mark off as so wildly improbable it's not worth worrying about, have to be considered, and handled.

Glad to read about ShardStore, and especially the formal verification, property based testing etc. The previous generation of services were notoriously buggy, a very good example of the usual perils of organic growth (but at least really well designed such that they'd fail "safe", ensuring no data loss, something S3 engineers obsessed about).

9 comments

mjb 1100 days ago

> daily occurrence when you're operating at S3 scale

Yeah! With S3 averaging over 100M requests per second, 1 in a billion happens every ten seconds. And it's not just S3. For example, for Prime Day 2022, DynamoDB peaked at over 105M requests per second (just for the Amazon workload): https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f...

In the post, Andy also talks about Lightweight Formal Methods and the team's adoption of Rust. When even extremely low probability events are common, we need to invest in multiple layers of tooling and process around correctness.

ignoramous 1100 days ago

James Hamilton, AWS' chief architect, wrote about this phenomena in 2017: At scale, rare events aren't rare; https://news.ycombinator.com/item?id=14038044

jdwithit 1100 days ago

James' posts are always a treat. It's so rare to encounter such plain, straightforward content from someone with a title and responsibilities like his. Without layers of marketing sugar over everything. Dude just wants to post about the cool shit he did on his GeoCities-tier website and I love it.

aborsy 1100 days ago

This phenomenon is just multiplication of the sample size (scale) times a probability (rare).

maweki 1100 days ago

It shows that, however improbable, people do win the lottery.

It's good to be reminded of that, if you've been trained for years, not to play the lottery because you personally won't ever win.

In this case, the Cloud vendor is the lottery organizer and they indeed need to plan for people winning.

da39a3ee 1100 days ago

I agree with what I think is your sentiment -- that people seem to be treating this as if it's some sort of profound insight that you only get if you work at a very senior level in engineering for major US cloud providers, when it's in fact blindingly obvious!

Twirrim 1098 days ago

> that people seem to be treating this as if it's some sort of profound insight that you only get if you work at a very senior level in engineering for major US cloud providers, when it's in fact blindingly obvious!

I don't mean to imply it's a profound insight, and the discussions I had in AWS were never in those terms. It's just that when you're designing and building things that are going to operate at that scale, you have to very seriously consider the improbable.

What's more difficult is actually knowing what needs to be considered. e.g. prior to working at AWS, I don't think I'd have even considered "NIC corrupts packet, in such a way it gets to the OS mangled" as something that would be worth handling. Yet S3 and similar scale services see that and other improbable events so regularly that they actually have to consciously design for it, everywhere.

It's also one reason why larger services end up being incredibly conservative about the use of technology. You know what the failure modes are, however improbable, and can account for them. New technology tends to be kept on the fringes, and only adopted in more significant places once proven and improbable failures become understood.

da39a3ee 1096 days ago

Thanks! That was interesting and helpful.

Spooky23 1099 days ago

Well it is - nobody maintains the level of detail required to actually know about these sorts of events.

I worked on a safety critical system where we’d find all sorts of unusual bugs… because we were looking for them. It really narrowed the scope for product selection, many vendors were just disqualified.

PaulRobinson 1100 days ago

Was an SDM of a team of brand new SDEs standing up a new service. In a code review, pointed to an issue that could cause a Sev2, and the SDE pushed back "that's like one in a million chance, at most". Pointed out once we were dialled up to 500k TPS (which is where we needed to be at), that was 30 times a minute... "You want to be on call that week?". Insist on Highest Standards takes on a different meaning in that stack compared to most orgs.

rubiquity 1100 days ago

Daily? A component I worked on that supported S3’s Index could hit a 1 in a billion issue multiple times a minute. Thankfully we had good algorithms and hardware that is a lot more reliable these days!

Twirrim 1100 days ago

This was 7-8 years ago now. Lot of scaling up since those days :)

rubiquity 1100 days ago

I’m sure my numbers are out of date now too

rkagerer 1100 days ago

Personally I'd love working in that kind of environment. That one in a billion hole still itches at me. There's also a slightly-perverse little voice in my head ready with popcorn in case I'm lucky enough to watch the ensuing fallout from the first major crypto hash collision :-).

fooker 1100 days ago

That probability is significantly lower than one in a billion.

One in a billion would be if keys were ~30 bits. Luckily it isn't.

rkagerer 1099 days ago

The one in a billion was in reference to storage related stats described in the article. Not private crypto keys.

delecti 1100 days ago

I love conversations like this that remind me how unintuitive big numbers are.

ldjkfkdsjnv 1100 days ago

Also worked at Amazon, saw some issues with major well known open source libraries that broke in places nobody would ever expect.

wrboyce 1100 days ago

Any examples you can share?

ruckfool 1100 days ago

Redis Node failover

ldjkfkdsjnv 1100 days ago

Apache tomcat starts to break down

thewakalix 1100 days ago

Could you elaborate?

baz00 1100 days ago

We get this on a much lower scale. We have to maintain many forks because no one is responsive on taking patches.

ilyt 1100 days ago

I think Ceph hit similar problems and they had to add more robust checksumming to the system, as relying on just tcp checksums for integrity for example was no longer enough

benou 1100 days ago

Not that surprising, given this was already extensively documented in the 2000's (so already widely known by then) with iSCSI and such, see https://www.rfc-editor.org/rfc/rfc3385 for example.

Twirrim 1100 days ago

Yes, I remember tcp checksumming coming up as not sufficient at one stage. Even saw S3 deal with a real head-scratcher of a non-impacting event that came down to a single NIC in a single machine corrupting the tcp checksum under very specific circumstances.

jamesblonde 1100 days ago

HDFS never relied on only network checksums. Blocks should be checksummed and validated at clients - a reliable end-to-end guarantee.

Twirrim 1100 days ago

Well... yeah. S3 has checksums and all sorts of fixity checks right throughout. At no stage do they ever rely on a single mechanism. If there's one thing they're insanely paranoid about, it's data correctness and durability.

It has been several years, so I really don't remember much about the tcp checksum / corrupting NIC thing. Typically tcp checksum failures are handled entirely by the NIC, you wouldn't even notice it. My vague recollection was it coming up between two services not in the customer synchronous path (so e.g. not involved in getting data to or from the customer), and it caused something on the OS side.

I do remember that there was a contingent of engineers that were convinced it was a cosmic ray bit flip, which seems this whole thing certain types of engineers end up doing when presented with improbable seeming circumstances. It wasn't until it had happened a second or third time (weeks later) that they realised the origin machine was the same each time, and were able to dig in deeper to the point of reproduction.

jacobgorm 1099 days ago

To think that when Andy’s Coho Data built their first prototype on top of my abandoned Lithium [1] code base from VMware, the first thing they did was remove “all the crazy checksumming code” to not slow things down…

[1] https://dl.acm.org/doi/10.1145/1807128.1807134

Waterluvian 1100 days ago

Ever see a UUID collision?

cmckn 1100 days ago

Eh, UUID’s are usually not truly global anyway; so you’d need a collision in the context of a single region, cell, user, resource, etc. for it to matter.

mabbo 1100 days ago

Even at a billion requests per second, 128 bit UUIDs shouldn't collide for something like a billion years.

And that's if you're going completely random and not taking care to try to reduce collisions.

Dylan16807 1100 days ago

Are you sure about that math?

A billion seconds at a billion requests per second is already 2^60 items. You'd only need a few billion seconds to have a 50:50 collision chance with 128 random bits, and even less with a real UUID that only has 122 random bits.

You'd hit 1% odds of collision after less than a decade.

If you actually want to go for a billion years, you need to expand that UUID by 50%.

mabbo 1100 days ago

You know I think I converted powers of two and powers of ten interchangeably in my calculations. You're very likely correct.

danielmarkbruce 1100 days ago

This seems off. A few billion seconds to have a 50:50 chance? Why wouldn't it be a billion seconds at a billion per second (2^60 total requests) would give a 1 in 2^68 chance (or 1 in 2^62 if its really only 122 bits)?

Dylan16807 1100 days ago

Birthday paradox. The number of opportunities to collide is the number of items squared. (Divided by two and a smidge)

danielmarkbruce 1100 days ago

Lol. I must be brain dead. Yes.

penteract 1100 days ago

Because we're talking about collisions, as opposed to comparing 2^64 independent pairs. With 2^128 possible values, if you've picked 2^63 distinct ones, the chance that a randomly selected value collides with one of those is 1 in 2^65. If none of your second batch of 2^63 collide with each other, that gives a 2^63/2^65 = 1/4 chance of one of them colliding with the first batch. Considering the possibility of collisions within each batch of 2^63 brings it closer to 1 in 2.

jandrewrogers 1100 days ago

There have been many cases of UUIDv4 collisions because an RNG wasn’t as random as expected, due to broken RNG or developer error. It is one of those cases where practice is not as reliable as theory, and it is banned in some places as a consequence.

It depends on how paranoid you need to be.

MichaelZuo 1099 days ago

NIST standards on RNG are not as random as expected?

Or do you mean certain folks intentionally chose substandard implementations for some reason?

jandrewrogers 1096 days ago

A significant number of implementers roll their own UUIDv4. It seems so easy so why not? Most UUIDs are used in contexts where the devs are not that sophisticated so it isn’t that surprising that naive mistakes happen. If you are using it for distributed UUID generation, it just takes one person making a mistake to create havoc.

UUIDv4 is banned in many high security environments primarily because it is easy for people to screw up in practice and it is difficult to detect when those mistakes are made. 128-bits doesn’t leave much room for mistakes using probabilistic uniqueness.

polynomial 1100 days ago

Facts.

lazide 1100 days ago

Shouldn’t != never happens. All sorts of weird implementation issues can cause problems.