Hacker News new | ask | show | jobs
by Kalium 842 days ago
At the scale of Meta, "down" is a nuanced concept. You are very unlikely to get every piece of functionality seizing up at once. What you are likely to get is some services ceasing to function and other services doing error-handling.

For example, if the service that authenticates a user stops working but the service that shows the login form works, then you get a complex interaction. The resulting messaging - and thus user experience - depend entirely on how the login page service was coded to handle whatever failure the authentication service offered up. If that happens to be indistinguishable from a failure to authenticate due to incorrect credentials from the perspective of the login form service, well, here we are.

At Meta's scale, there's likely quite a few underlying services. Which means we could be getting something a dozen or more complex interactions away from wherever the failures are happening.

3 comments

Isn't this just the standard problem of reporting useful error messages? Like, yes, there are academic situations where you can't distinguish between two possible error sources, but the vast majority of insufficiently informative error messages in the real world arise because low effort was applied to doing so.
Yes and no.

Yes, with the additions of sheer scale, a vast number of services, multiple layers, and the difficulty of defining "down" added in. I think the difficulty of reporting useful error messages is proportional to the number of places an error can reasonably happen and the number of connections it can happen over, and by any metric Meta's got a lot of those.

No, in that detecting when you should be reporting a useful error message is itself a complex problem. If a service you call gives you a nonsense response, what do you surface to the user? If a service times out, what do you report? How do you do all this without confusing, intimidating, and terrifying users to whom the phrase "service timeout" is technobabble?

> If a service you call gives you a nonsense response, what do you surface to the user?

If this occurred during the authentication process, I think I would tell the user "Sorry, the authentication process isn't working. Try again later." rather than "Invalid credentials". And you could include a "[technical details]" button that the user could click if they were curious or were in the process of troubleshooting.

Slightly unrelated question, but just how "Big" is Meta? I know it's vast, but as an outsider I have trouble grokking the scale of it.
When most people talk about serving thousands and maybe millions of requests per second, Meta talks about billions of requests per second.

https://read.engineerscodex.com/p/how-facebook-scaled-memcac...

> If that happens to be indistinguishable from a failure to authenticate due to incorrect credentials from the perspective of the login form service, well, here we are.

If you can't distinguish those, then that is bad software design.