Hacker News new | ask | show | jobs
by str33t_punk 2651 days ago
Both companies are massive and have tons of developers. It becomes almost impossible to look at the system as a whole with the amount of changes coming through. And, you get scenarios where small failures cascade through the stack reaking havok. Often times its just one config change

Its telling that one of the hottest areas of distributed systems research these days is the boring topic of configuration management. Google, Microsoft, etc are paying researchers top dollar to figure out how to prevent massive outages through novel techniques. It is one of the harder problems to solve and requires massive investment in tooling, refactoring, etc.

2 comments

You’re undeniably right about not looking at Facebook or Google as one whole system, but there have also been what seems like an unprecedented number of strange little outages (see the ones mentioned by https://news.ycombinator.com/item?id=19382418) that aren’t huge companies. My workplace had some of their own today that I haven’t heard an incident report about (it’s a pretty large company and I’m not in IT).
>>Google, Microsoft, etc are paying researchers top dollar to figure out how to prevent massive outages through novel techniques

Curious what makes you think this. Are there specific job postings in either company that are focused on this?

I work for Microsoft, I know of at least CrystalNet [1].

[1] - https://www.microsoft.com/en-us/research/blog/eliminating-ne...