| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by str33t_punk 2651 days ago
	Both companies are massive and have tons of developers. It becomes almost impossible to look at the system as a whole with the amount of changes coming through. And, you get scenarios where small failures cascade through the stack reaking havok. Often times its just one config change Its telling that one of the hottest areas of distributed systems research these days is the boring topic of configuration management. Google, Microsoft, etc are paying researchers top dollar to figure out how to prevent massive outages through novel techniques. It is one of the harder problems to solve and requires massive investment in tooling, refactoring, etc.

2 comments

snazz 2651 days ago

You’re undeniably right about not looking at Facebook or Google as one whole system, but there have also been what seems like an unprecedented number of strange little outages (see the ones mentioned by https://news.ycombinator.com/item?id=19382418) that aren’t huge companies. My workplace had some of their own today that I haven’t heard an incident report about (it’s a pretty large company and I’m not in IT).

link

hideo 2651 days ago

>>Google, Microsoft, etc are paying researchers top dollar to figure out how to prevent massive outages through novel techniques

Curious what makes you think this. Are there specific job postings in either company that are focused on this?

link

bpye 2650 days ago

I work for Microsoft, I know of at least CrystalNet [1].

[1] - https://www.microsoft.com/en-us/research/blog/eliminating-ne...

link