Hacker News new | ask | show | jobs
by k1w1 4058 days ago
As an AWS user this type of thing gives me cause for concern:

At 2015-04-01 00:00 UTC, the Amazon EC2 "provisioned I/O" volume on which most of this metadata was stored suddenly changed from an average latency of 1.2 ms per request to an average latency of 2.2 ms per request. I have no idea why this happened -- indeed, I was so surprised by it that I didn't believe Amazon's monitoring systems at first -- but this immediately resulted in the service being I/O limited.

A sudden doubling of latency can have dire consequences on any system. Knowing that such unexpected changes are possible makes it built trust in your environment, even if it is running fine today.

2 comments

It's getting to the point where, when I see a post mortem like this, I am just waiting for the AWS problems. Between this and the downtime that AWS has, I'm kind of amazed that people use it-- you pay too much and you get less. (Compared to a lot of other choices, such as raw metal boxes from Hetzner)

This is why I don't use AWS for anything non-trivial, and I am wary of people who put critical infrastructure on it. (EG: I Don't care about netflix, that service can run on AWS fine, but coinbase, for instance, if I was their customer and they ran on AWS I would stop being their customer.)

Whenever AWS problems come up people talk about how "AWS is so much more efficient, you just outsource that stuff to the experts".

But that seems to imply that hosting on your own hardware in your own office is the only alternative. Of course we stopped doing that in the 1990s.

With AWS you have to know Linux and have ops people, that's true everywhere. With AWS you have the additional burden of learning the AWS APIs and learning how to use AWS, which isn't transferrable, so that's a higher cost. With AWS you have to architect around the limitations of the way AWS is built and your architecture becomes AWS specific if you use those APIS, so that's an additional cost. You don't need any less ops people, probably more, than going with another hosting service like Digital Ocean or Backspace. And if you go with something like Hetzner you pay 1/5th to 1/10th for machines with a lot more performance and local storage. (Though you get the additional latency of being located in Europe, if your primary customers are the USA.)

Of course, I'm also prejudiced. I worked at Amazon and saw how the sausage was made and was not impressed. When AWS was announced as "running on the same infrastructure that powers Amazon.com!!!" as if it was a feature, I cringed. Amazon.com was having outages of parts or major components on a weekly basis at that time. Much of AWS is actually running on bespoke software (so not actually tested by Amazon.com when introduced, though I'm sure portions have been moved over at gunpoint) ... which actually makes it worse. People were trusting their data to a service that pretended to be backing a major e-commerce site but was actually untested outside of the company at the time.

And what have we seen since? An unacceptable level of failures. (in my opinion, of course)

But people seem to be very forgiving. When it's happening everyone's in "how can we fix this mode" and then when it's fixed everyone forgets and goes back to thinking of AWS as always running.

To this day I still do not get why you would use AWS, the entire user experience is clunky and the pricing is crazy for what you get. Azure isn't much better with regards to downtime, but if you want something more than just a VPS I'd choose it any day over AWS for the significantly better UX in both the admin console and the command line tools + SDK.

Ultimately though, even with Azure or AWS you're going to need people knowledgeable enough to administer your compute instances anyway, so why not just run your full stack on a bunch of VM's from DigitalOcean or Linode or rent a couple dedicated servers and throw oVirt on them; saving yourself a significant chunk of money at the same time.

Indeed, I didn't know such a change was possible -- that EBS volume went for years with consistent low latency before it suddenly slowed down.
You could have contacted AWS support or emailed me. Either way, we would have investigated.
It wasn't missing its guaranteed # of I/Os per second, so I figured the slowdown was just "one of those things" and not an out-of-spec issue. Happy to send you the volume ID if you think someone would want to investigate (and still has data from the start of April) though.
Yes, please do.
DevOps/Infrastructure engineer here! I see this happen frequently in AWS. Never expect either your instance networking latency or the latency of the underlying EBS storage layer to be consistent.

If you absolutely need guaranteed IO performance, use an instance store or move to dedicated hardware. Them be the breaks of cloud computing.

http://en.wikipedia.org/wiki/Fallacies_of_distributed_comput...