Hacker News new | ask | show | jobs
by rented_mule 1322 days ago
Congratulations, but that is unrelated to what Twitter is doing. How would your Solaris box hold up to half a billion tweets a day distributed in near real time across a user graph with 100M nodes, all while storing those tweets durably and allowing users to search and retrieve a long history of them? It's not simple at all.

Unlike your Solaris box, they are the target of constant advanced hacking attempts. I've been a part of the response when AWS was doing urgent work because of a security incident. The company I worked at was large enough to be paying AWS over a $1M a month when one such incident required dozens of our engineers working around the clock for three days to deal with AWS's response. We weren't even directly involved in the security issue. But without that engineering effort, our product would have shut down. There were other security incidents we were directly involved in and those would have taken us down without an even bigger response (whether or not we were running in AWS).

And then there are hardware failure rates. Hard drives alone fail at a rate of 1-2% per year[0]. Not a big deal on a single box. A very big deal when you have many thousands of hard drives - multiple drives fail every day. Unless you want to WAY over-allocate storage for redundancy. Even with that, there are surprising vulnerabilities to hardware failure at this scale.

----

[0]https://www.backblaze.com/b2/hard-drive-test-data.html

1 comments

But hard drive failures are why you pay a cloud company with live migrate (ie not AWS) for their service. The physical hardware the machine is running on will eventually fail, as you note, but the VM will keep on ticking on basically forever * and you'd never know the hard drive/SSD underneath it failed.

* Live migrate won't upgrade the CPU family you're running on, so eventually someone/a something on your end will be forced to deal with migrating it, but that's O(years).