|
|
|
|
|
by peterwwillis
5743 days ago
|
|
The second strangest part of this outage (to me) was that the cluster "was quickly overwhelmed by hundreds of thousands of queries a second." Does Facebook not have a way to curtail the number of queries being sent to its databases? I'm not a highly experienced programmer or dba but i've seen mod_perl websites that can do this with their API layer. It was engineered in once it was realized the database cluster could only do so many queries and connections at a time and they didn't want to lock up the database servers. The strangest thing to me was that the clients had the permission to essentially cause a race condition across the whole site. From what I understand, the client ran an API call which either forced a re-fetch of a key which apparently only needed to be fetched once (and thus theoretically could have been staged in advance using a database application not running on the frontend site which could update the cache and prevent a database fetch, or just update the database, whichever was more necessary), or failing the database connection due to the aforementioned thundering herd of QPS it also triggered a re-fetch from the db (which again could have been prevented by a db app pre-loading the new value). So, if my outsider's idea of how Facebook's code works is accurate, this could have been prevented if either the cache/database was "pre-fetched" in the background not using client API calls, or if client API calls simply weren't allowed to all modify [and read] from a single key in the database. The latter point seems less likely than the former, but possible. (Sorry if i'm speaking out of line or as an uneducated FB user, but according to these comments and RJ's breakdown this is how it appears to me) |
|