Hacker News new | ask | show | jobs
by psanford 854 days ago
There's a bunch of things in here that don't really make sense:

> The incident was caused by a third-party caching client library that was recently integrated into our system. This client library received unprecedented load conditions caused by devices coming back online all at once. As a result of increased demand, it mixed up device ID and user ID mapping and connected some data to incorrect accounts.

What? How does load on the system affect correctness?

> The outage originated from our partner AWS

What does this mean? Was there an AWS outage for a service they use, or was this just a normal loss of an instance?

It's interesting that they blame external entities for the root causes of the incident and don't take responsibility for what is ultimately on them.

3 comments

I assume the code was always incorrect, but only exhibits the problem in practice under high load. This could be a race-condition/data-race, or treating short hashes as unique.
It’s just a WAG, but I bet someone used a timestamp as a unique key, or at least part of one, so you were unlikely to get collisions except under load.
> What? How does load on the system affect correctness?

Seen this happen quite often with code that is not multi-thread safe, especially in languages like c# and java, such as using a static class property for data that should be request-scoped, or not using the appropriate concurrent collection classes etc.