Hacker News new | ask | show | jobs
by mrmondo 3671 days ago
Yeah we monitor lots of Amazon & Microsoft 'cloud' services, we observe much, much higher downtime / number of outages than they ever report in a order of 50 to 1 or more. What do you expect though, both companies are known for lying through teeth to convince the IT community (or more likely the IT managers) that their services are reliable for everyone and amazing uptime and that they're not only a good option but the only option.
4 comments

> What do you expect though, both companies are known for lying through teeth to convince the IT community (or more likely the IT managers) that their services are reliable for everyone and amazing uptime

Their uptime is much higher on average than any IT team I've ever been involved in.

Oh wow really? That's really bad - you must have worked with some really poor ops teams in the past. Last year we measured less than 97% uptime on AWS Sydney, and a shocking 96% uptime for Office 365 exchange online. Most of the problems when we investigated them out of interest were due to either internet network routing issues within their networks (or first ISP hop), or they just had hosts outright fail. The 'cloud' is just outsourced hardware with a provided toolset (APIs etc...), Amazon itself claims that you must have your hosts across various zones to get decent uptime - that's like saying "oh yes - the Toyota Carolla is really reliable, it works 99.99% of the time... As long as you buy a second one for when it's not available".

Our internal uptime is 99.985 in production, we are fast moving and roll out changes every day, we run mainline kernels and all of our 350 odd servers and 800~ containers are running on completely vendor independent, open source software.

I'm not saying it's easy, but the middle man is there to help you if you can't find or afford up front good operational engineers, or to take your money because their advertising has made you believe that they are always the best decision.

We perform an in-detail yearly cross-cost comparison between AWS and our operated datacentre, the cost to run and maintain the same uptime, processing power (and yes we take into account spinning down instances at night etc...), bandwidth between zones, backups and customers and it really hasn't improve at all over the past 3 years. This year the review came back that our yearly expenditure on operational expenses would increase from approximately $500,000 (including human resources) to well over $3,000,000 a year. (Not kidding), the margin of error was approximated at between 10-20%.

> Oh wow really? That's really bad - you must have worked with some really poor ops teams in the past.

You sound genuinely very smart and knowledgeable in this area. But the other 90% of the workers in this sector are not.

> Amazon itself claims that you must have your hosts across various zones to get decent uptime - that's like saying "oh yes - the Toyota Carolla is really reliable, it works 99.99% of the time... As long as you buy a second one for when it's not available".

Wait, you don't have a second data center for your mission critical systems in case your primary fails?

> We perform an in-detail yearly cross-cost comparison between AWS and our operated datacentre...bandwidth between zones, backups and customers and it really hasn't improve at all over the past 3 years

I totally agree. If you have the right resources, a good data center partner and well defined process, then "the cloud" isn't for you. For the other 90% of the people out there that simply don't have the know-how, knowledge, or resources to find talented IT operational excellence, then AWS totally makes sense.

Yes we have two datacentres and we do have a few VPS mostly for triangulation of monitoring, but honestly, in four years - we haven't had to failover once, although we practise it with our applications almost every single day.

Thank you for the kind words there, I think one major thing for us is that we've hired a small number of just the right people, each with quite different backgrounds and we work VERY closely with our developers. Every bit of configuration is kept in GIT and we CI / CD whatever we can.

> Yes we have two datacentres

That's all that Multi-AZ is mate ;)

Those icons don't change unless a certain percentage of the overall count of instances in an AZ or region are affected.

Most of the time people here might be seeing good portions of their infra go away, but the number isn't statistically significant to the overall region health for them to post an outage.

Don't ask me what those numbers are, but that is the way it is determined.

Sounds interesting. Is that data available?
From my experience in AWS, part of the problem is scope of impact. It's easy to lose track of just how many active customers there are at any time, and it's easy to see the platform as a cohesive whole, i.e. "If it's affecting you it must be affecting everyone else". In reality almost every customer impacting event affects only a tiny percentage of the active users at any one time. I know it can be hard to believe or see this as an external customer, because after all the service appears to be down to you. Take, for example, when people start saying "us-east-1a" is down. What is "us-east-1a"? If you've watched some of the re-invent talks you'll know that it actually describes numerous data centres, in close proximity (within a certain millisecond network target). If one of those has an incident, it might look to some customers like "us-east-1a" is down, when the reality might be that 95%+ of the data centres still fully functional, and most customers aren't seeing an impact.

You might have an incident affecting just 2% of the API calls, and affecting less than 2% of the user base (even that would be unusually large and a source of big drama internally). The service could be super stable and extremely reliable, but that 98% could get completely the wrong idea if they saw a service status, (and of course from a PR perspective, the same goes for anyone looking to use the platform.)

A service dashboard is an extremely blunt tool with which to pass out a message about service status. It renders what is an extremely nuanced situation down to "All good, maybe, no, DEAD"

To give a rough example, one service I was familiar with had a "page everyone in the team" level of incident. API availability tanked, badly. It looked atrocious, and seemed like hardly any requests were getting through successfully. You'd have every expectation that they should at least post a yellow alert, if not approaching red. It turned out that it was one single customer who's requests were failing (I forget the reason why), but due to a bug in the customer's software consuming the API, every time it got a 500 response, it would immediately resend the request, every single time, with no timeout or limited retry number. It reached such a terrific pace it got to the point where they made up a huge majority of all the requests hitting the endpoint. Every other customer using the service was completely fine. If you'd looked at the API graphs you'd think "POST YELLOW, POST YELLOW, NOW NOW NOW!", but because they took time to figure out the actual impact, they found out that would have been totally the wrong thing to do.

Service health dashboards are a neat idea, but one that is in desperate need of a rethink and overhaul. It has some value when you're a smaller service, but it just doesn't accurately scale with the platform.

I'm not sure what the real solution is. They've somehow got to pull together TB of logs and/or metrics to make an accurate assessment of the scenario, and do it in a matter of minutes, so as to provide accurate updates, and not needlessly panic customers.