Hacker News new | ask | show | jobs
by schrep 5744 days ago
We actually encounter Thundering Herd problems on a very regular basis. The Starbucks page has nearly 14M fans and posts may get tens of thousands of comments/likes with a high update rate. You have a lot of readers on a frequently changed value which means it is not often current in cache and you can have a pileup on the database.

Since we encounter this on a regular basis we have built a few different systems to gracefully handle them.

Unfortunately, the event today was not just a thundering herd because the value never converged. All clients who fetched the value from a db thought it was invalid and forced a re-fetch.

2 comments

Would monitoring the rate of invalidation and triggering an event handler help ?
The second strangest part of this outage (to me) was that the cluster "was quickly overwhelmed by hundreds of thousands of queries a second." Does Facebook not have a way to curtail the number of queries being sent to its databases? I'm not a highly experienced programmer or dba but i've seen mod_perl websites that can do this with their API layer. It was engineered in once it was realized the database cluster could only do so many queries and connections at a time and they didn't want to lock up the database servers.

The strangest thing to me was that the clients had the permission to essentially cause a race condition across the whole site. From what I understand, the client ran an API call which either forced a re-fetch of a key which apparently only needed to be fetched once (and thus theoretically could have been staged in advance using a database application not running on the frontend site which could update the cache and prevent a database fetch, or just update the database, whichever was more necessary), or failing the database connection due to the aforementioned thundering herd of QPS it also triggered a re-fetch from the db (which again could have been prevented by a db app pre-loading the new value). So, if my outsider's idea of how Facebook's code works is accurate, this could have been prevented if either the cache/database was "pre-fetched" in the background not using client API calls, or if client API calls simply weren't allowed to all modify [and read] from a single key in the database. The latter point seems less likely than the former, but possible.

(Sorry if i'm speaking out of line or as an uneducated FB user, but according to these comments and RJ's breakdown this is how it appears to me)