Hacker News new | ask | show | jobs
by tgsovlerkhgsel 854 days ago
Do you think the issue was something else? "People randomly see other people's content" is an issue that would immediately make me think some issue with caching is the culprit.

Given their openness in the rest of the communications, I don't see why they would make this part up.

Edit: Of course, I'm also curious what the actual bug was. A discussion below is suggesting several plausible ways (e.g. concurrency issues, insufficient entropy in some key) how a problem could happen under load (although many of these would also lead to the problem happening with less load, just much less often).

2 comments

> Do you think the issue was something else?

No, I'm not questioning whether or not it was a caching issue. I'm taking exception to the lack of accountability. They chose the library. They (probably) chose to ignore a documented or common failure mode of caching systems through either poor choice of key or lack of synchronization. They've obviously designed their infrastructure in a way that isn't resilient to its current level of usage (cold start is a normal part of software's lifecycle).

They could have chosen to own that, but instead they blamed everyone else. That's not a sign of a trustworthy service provider.

It's not even that: the quoted language doesn't even blame the library - it appears to blame increased load.

"As a result of increased demand, it mixed up device ID" - no, it mixed up IDs as a result of some sort of a concurrency bug. I don't understand the point of deflecting this far.

Likely to be a multi-threading issue; my bet is the cache client wasn't thread-safe. I've seen this in some apps before and the solution was to turn off multi-threading while we debug the library that was causing the issue.
This is the answer!
> I'm taking exception to the lack of accountability.

I'll bet money that their statement was run through legal and stripped of all possible blamey statements.

> They could have chosen to own that, but instead they blamed everyone else. That's not a sign of a trustworthy service provider.

I agree. Companies need to own up to their fuckups, even with legal tells them that it can hurt. Because all companies will fuck up; how they handle it is the differentiator.

> I'm also curious what the actual bug was

Hardware. Rowhammer-type effects occurring accidentally under sudden load spikes. The hardware has just got too dense.

(I should clarify this is speculation, but reading the recent article included here on sudo using special maximum-distance bitfields to hold state internally (https://news.ycombinator.com/item?id=39165342)... it must be a problem that's being observed in the wild)

I can't imagine that happening with a sufficient frequency. A system making such mistakes so often would just be too unstable to keep an uptime >1h.
With the 'cattle not pets' mindset that pervades modern development is the lifespan of ephemeral cache VMs that closely monitored? They get spun up and down on demand in most architectures. I can see this being an edge case failure when the system is trying to scale up, the existing VMs are getting absolutely hammered, the hypervisor is trying to start up new ones, memory pressure and iops on the existing ones are maxed out...

It just seems like the most obvious root cause to me, a single bit-flip in a hashed value is going to give you the wrong result data without any other error because the hash value is already essentially heavily compressed, meanwhile the hash table is almost certain to be 100% stored in memory and very heavily accessed from multiple directions in a read/write manner.

No. That's not how that works.