How is it that Amazon.com is so reliable if there are so many problems with their "cloud" products? Do they not use the same software to run their site?
If you understand the limitations of the various products you can build a VERY reliable service. The reddit assumption of a single datacenter and single technology to store that data was an engineering failure. They essentially didn't have a disaster recovery plan in place.
I'm sure reddit's engineers are as capable as any for producing a seemless disaster recovery plan, but the most common obstacle to implementing it is cost. Most web services choose the occasional risk of downtime in one data center instead of incurring the cost of being in two data centers at all times.
Yep. And there's that whole asymptotic cost/complexity curve where as you chase more 9's of perfection, your cost and complexity rises out of proportion to the value you're getting. At the end of the day, no matter how much we might like Reddit, it's still just a website with social discussion forums and link sharing, full of non-essential chatter and pictures of kitties. (Again, I love Reddit, don't get me wrong, but it's far from a Mission Critical resource for any business or person's life.) So achieving perfect reliability & performance is probably not worth the cost/pain.
I suspect it's because amazon.com has different performance requirements. For instance, I imagine the read/write balance is very different for amazon.com than for reddit.com.
If I ran the tech at Amazon, I'd want to reuse as much otherwise internal tooling and software architecture and best practices between EC2 and core Amazon.com as possible. But, have physically separate machines and network zones. Maybe share some of same data center, of course, but that's as far as I'd take it, and even that sounds a little risky.