Hacker News new | ask | show | jobs
by tw04 1964 days ago
>I don't mean this ironically, but I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck.

Isn't that literally supposed to be the sales pitch for the cloud? Get away from the infrastructure as a whole so you can focus on code, and let the cloud providers wave their magic wand to enable scaling?

If you're saying now the story is: well rely on them to auto scale, until they don't - then why would I bother? Now you're telling me I need to go back to having infrastructure experts, which means I can save TON of money by going with a hosting provider that allows allocation of resources via API (which is basically all of them).

5 comments

No, the cloud provides scalable infrastructure, but once you are in the 0.01% and you have very unique usage patterns, you still need to know how to set up your infrastructure for your needs. The difference is that instead of writing and managing a scalable cache, you just need to build the layer that knows to pre-provision for that scale/talk with AWS to make sure the system has sufficient capacity.

The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.

This feels like AWS apologism.

Slack knew how to set up their infrastructure. Nothing in the postmortem implies AWS was misconfigured. AWS spotted the problem and fixed it entirely on their side.

Nothing in this report suggests that Slack has unique usage patterns. Users returning to work after Christmas is not a phenomenon unique to Slack.

Their problems were:

1. The AWS infrastructure broke due to an event as predictable as the start of the year. That's on Amazon.

2. Their infrastructure is too complicated. Their auto-scaling created chaos by shutting down machines whilst engineers were logged into them due to bad heuristics, although it's not like this was a good way to save money, and their separation of Slack into many different AWS accounts created weird bottlenecks they had no way to understand or fix.

3. They were unable to diagnose the root cause and the outage ended when AWS noticed the problem and fixed their gateway system themselves.

The cloud isn't some magic thing that solves all scaling problems

In this case it actually created scaling problems where none needed to exist. AWS is expensive compared to dedicated machines in a colo. Part of the justification for that high cost is seamless scalability and ability to 'flex'.

But Slack doesn't need the ability to flex here. Scaling down over the holidays and then back up once people returned to work just isn't that important for them - it's unlikely there were a large number of jobs queued up waiting to run on their spare hardware for a few days anyway. It just wasn't a good way to save money: a massive outage certainly cost them far more than they'll ever save.

It wasn't scaling "back up". It was a huge spike as evertone refilled cache at the same time.

It's similar to Black Friday spikes Amazon handles themselves.

> The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.

I don't think anyone who's got any reasonable level of experience is expecting that it's a magic wand.

There are, though some things in AWS (and for sure other cloud providers) where you get no useful signals or controls. It's entirely managed by the cloud provider, based on their own internal metrics and scaling behaviors.

Behind the scenes, their load balancer services don't give you indications of how heavily loaded they are - nor do you get to directly control how many/big those load balancers are.

In some parts you can hack around this by pre-warming infrastructure by generating fake traffic - but that assumes that you have those metrics and knowledge that you even need to do this.

This applies to all sorts of things - there's hidden caps and other capacity limits all over AWSs platform that you don't know about until you hit them. There's even capacity limits that you can know about, because they're publicly documented, but AWS lies and won't tell you the actual limit being applied to your account - the console and documentation says one thing, but in reality it's a lot lower.

If that capacity limit resulted in an outage, well, tough luck.

If you are serious about reliability you always need infrastructure experts.

AWS is pretty good about documenting the limits of their systems, SLAs, how to configure them, etc. They don't just say you should wave a magic wand -- and even if they did say that, professional software engineers know better.

"a hosting provider that allows allocation of resources via API" is exactly what AWS is. Your infrastructure experts come into the picture because they need to know which resources to request, how to estimate the scale they need, and how to configure them properly. They should also be doing performance testing to see if the claimed performance really holds up.

> Isn't that literally supposed to be the sales pitch for the cloud? Get away from the infrastructure as a whole so you can focus on code, and let the cloud providers wave their magic wand to enable scaling?

Clearly there are limits even with the largest cloud providers. You'll have to engage a bit of critical thought in to whether you're going to get near those limits and what that might mean for your product. Obviously that's easier said than done, but you could argue that the cloud providers are still giving you reasonable value if you can pass the buck on a given issue for x years.

You have to know how to write code that fits into the cloud. You can't arbitrarily read/write to the file system, acting as if there's only one instance of the server running (if you plan to run hundreds or thousands). So even by waving the cloud 'magic wand', you still need to understand writing code in a cloud-friendly way. So in some sense, it's a shared responsibility between the vendor and engineering. You need to understand how to apply the tools being given to you.
Per the article, literally nothing in their code would have solved the issue. AWS was supposed to auto-scale TGWs and didn't.

>Our own serving systems scale quickly to meet these kinds of peaks in demand (and have always done so successfully after the holidays in previous years). However, our TGWs did not scale fast enough. During the incident, AWS engineers were alerted to our packet drops by their own internal monitoring, and increased our TGW capacity manually. By 10:40am PST that change had rolled out across all Availability Zones and our network returned to normal, as did our error rates and latency.

Correct, I was disputing the point that you can freely code without being mindful of the architecture even though the selling point of cloud providers is "focus on code, leave architecture to us". I'm not disputing in this case AWS was at fault: as the customer, Slack did everything right.
> Isn't that literally supposed to be the sales pitch for the cloud?

Yes.

But a sales pitch is the most positive framing of the product possible. I wouldn't rely on the sales pitch when making the decision about how much you should depend on the cloud.