Hacker News new | ask | show | jobs
by blackaspen 1315 days ago
I was an SRE there (even senior SRE!) -- left in 2018 but have still been pretty in-the-loop for the past few years due to friends there. The question was, how long would thing stay up if everything stopped: the answer was <1 week for everything (including ads, things you don't see, etc). Maybe 2 or 3 for most core functionality.

With some of their service issues in the past few weeks (DC stuff that's been publicized), it could be less. To the very best of my knowledge no substantial resiliency work has been done recently.

I do know that a good number of folks on-call for core services were laid off while on-call, so, that bodes well. I feel bad for everyone left trying to keep things running.

1 comments

That's very interesting that there isn't enough resiliency baked in to keep it going forever.
In something of this scale, "enough resiliency baked in to keep it going forever" is not possible. There are many reasons for this...

One is that hardware fails and needs to be replaced. That requires people who know how to install the replacement hardware and deploy to it. That's assuming the new hardware is 100% compatible - that won't be the case for more than a handful of years.

Another is currently unknown security vulnerabilities, whether in their own code or in external packages they use. Those vulnerabilities are there and they will be discovered. Once they are, things start being taken down from the outside until the system collapses.

Yet another is bugs. Every system of this scale has a large number of bugs, many of them unknown. Some of those won't be discovered until the right conditions arise - the right combination of data, timing, etc. When they are finally triggered, some of those bugs will take down entire subsystems, some of which are critical to the product functioning.

There are many more examples like this. There is no such thing as indefinite resiliency for anything near this scale.

I have a sun Solaris in my office that was powered up in 1998 and has faithfully served NIS/YP without hardware or software fault since that day.

The modern version of this (kubernetes + AWS/GCP), if designed could likely continue to run for a long long time. Especially a product as simple as twitter.

Congratulations, but that is unrelated to what Twitter is doing. How would your Solaris box hold up to half a billion tweets a day distributed in near real time across a user graph with 100M nodes, all while storing those tweets durably and allowing users to search and retrieve a long history of them? It's not simple at all.

Unlike your Solaris box, they are the target of constant advanced hacking attempts. I've been a part of the response when AWS was doing urgent work because of a security incident. The company I worked at was large enough to be paying AWS over a $1M a month when one such incident required dozens of our engineers working around the clock for three days to deal with AWS's response. We weren't even directly involved in the security issue. But without that engineering effort, our product would have shut down. There were other security incidents we were directly involved in and those would have taken us down without an even bigger response (whether or not we were running in AWS).

And then there are hardware failure rates. Hard drives alone fail at a rate of 1-2% per year[0]. Not a big deal on a single box. A very big deal when you have many thousands of hard drives - multiple drives fail every day. Unless you want to WAY over-allocate storage for redundancy. Even with that, there are surprising vulnerabilities to hardware failure at this scale.

----

[0]https://www.backblaze.com/b2/hard-drive-test-data.html

But hard drive failures are why you pay a cloud company with live migrate (ie not AWS) for their service. The physical hardware the machine is running on will eventually fail, as you note, but the VM will keep on ticking on basically forever * and you'd never know the hard drive/SSD underneath it failed.

* Live migrate won't upgrade the CPU family you're running on, so eventually someone/a something on your end will be forced to deal with migrating it, but that's O(years).