Hacker News new | ask | show | jobs
by dplgk 1964 days ago
> On January 4th, one of our Transit Gateways became overloaded. The TGWs are managed by AWS and are intended to scale transparently to us. However, Slack’s annual traffic pattern is a little unusual: Traffic is lower over the holidays, as everyone disconnects from work (good job on the work-life balance, Slack users!). On the first Monday back, client caches are cold and clients pull down more data than usual on their first connection to Slack. We go from our quietest time of the whole year to one of our biggest days quite literally overnight.

What's interesting is that when this happened, some HN comments suggested it was the return from holiday traffic that caused it. Others said, "nah, don't you think they know how to handle that by now?"

Turns out occam's razor applied here. The simplest answer was the correct one. Return-from-holiday traffic.

5 comments

Though the nuance is Slack did know how to handle it, AWS didn't.
I don't mean this ironically, but I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck.

This usually works well, under the rationale that "upstream provider does this for a living, so they must be better than us at this", but if you have too unique needs (or are just a bit "unlucky"), it can fail too.

All this to say that the cloud isn't magic. From a risk/error prevention point of view, it's not that different from writing software for a single local machine: not every programmer needs to know how to manually do memory management, it makes a lot more sense to rely on your OS and malloc (and friends) for this, but the caveat is that you do need to account for the fact that malloc may fail. In the cloud case, one can't just assume that you'll always be able to provision a new instance, scale up a service, etc. The cloud is like a utility company: normally very reliable, but they do fail too.

>I don't mean this ironically, but I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck.

Isn't that literally supposed to be the sales pitch for the cloud? Get away from the infrastructure as a whole so you can focus on code, and let the cloud providers wave their magic wand to enable scaling?

If you're saying now the story is: well rely on them to auto scale, until they don't - then why would I bother? Now you're telling me I need to go back to having infrastructure experts, which means I can save TON of money by going with a hosting provider that allows allocation of resources via API (which is basically all of them).

No, the cloud provides scalable infrastructure, but once you are in the 0.01% and you have very unique usage patterns, you still need to know how to set up your infrastructure for your needs. The difference is that instead of writing and managing a scalable cache, you just need to build the layer that knows to pre-provision for that scale/talk with AWS to make sure the system has sufficient capacity.

The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.

This feels like AWS apologism.

Slack knew how to set up their infrastructure. Nothing in the postmortem implies AWS was misconfigured. AWS spotted the problem and fixed it entirely on their side.

Nothing in this report suggests that Slack has unique usage patterns. Users returning to work after Christmas is not a phenomenon unique to Slack.

Their problems were:

1. The AWS infrastructure broke due to an event as predictable as the start of the year. That's on Amazon.

2. Their infrastructure is too complicated. Their auto-scaling created chaos by shutting down machines whilst engineers were logged into them due to bad heuristics, although it's not like this was a good way to save money, and their separation of Slack into many different AWS accounts created weird bottlenecks they had no way to understand or fix.

3. They were unable to diagnose the root cause and the outage ended when AWS noticed the problem and fixed their gateway system themselves.

The cloud isn't some magic thing that solves all scaling problems

In this case it actually created scaling problems where none needed to exist. AWS is expensive compared to dedicated machines in a colo. Part of the justification for that high cost is seamless scalability and ability to 'flex'.

But Slack doesn't need the ability to flex here. Scaling down over the holidays and then back up once people returned to work just isn't that important for them - it's unlikely there were a large number of jobs queued up waiting to run on their spare hardware for a few days anyway. It just wasn't a good way to save money: a massive outage certainly cost them far more than they'll ever save.

It wasn't scaling "back up". It was a huge spike as evertone refilled cache at the same time.

It's similar to Black Friday spikes Amazon handles themselves.

> The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.

I don't think anyone who's got any reasonable level of experience is expecting that it's a magic wand.

There are, though some things in AWS (and for sure other cloud providers) where you get no useful signals or controls. It's entirely managed by the cloud provider, based on their own internal metrics and scaling behaviors.

Behind the scenes, their load balancer services don't give you indications of how heavily loaded they are - nor do you get to directly control how many/big those load balancers are.

In some parts you can hack around this by pre-warming infrastructure by generating fake traffic - but that assumes that you have those metrics and knowledge that you even need to do this.

This applies to all sorts of things - there's hidden caps and other capacity limits all over AWSs platform that you don't know about until you hit them. There's even capacity limits that you can know about, because they're publicly documented, but AWS lies and won't tell you the actual limit being applied to your account - the console and documentation says one thing, but in reality it's a lot lower.

If that capacity limit resulted in an outage, well, tough luck.

If you are serious about reliability you always need infrastructure experts.

AWS is pretty good about documenting the limits of their systems, SLAs, how to configure them, etc. They don't just say you should wave a magic wand -- and even if they did say that, professional software engineers know better.

"a hosting provider that allows allocation of resources via API" is exactly what AWS is. Your infrastructure experts come into the picture because they need to know which resources to request, how to estimate the scale they need, and how to configure them properly. They should also be doing performance testing to see if the claimed performance really holds up.

> Isn't that literally supposed to be the sales pitch for the cloud? Get away from the infrastructure as a whole so you can focus on code, and let the cloud providers wave their magic wand to enable scaling?

Clearly there are limits even with the largest cloud providers. You'll have to engage a bit of critical thought in to whether you're going to get near those limits and what that might mean for your product. Obviously that's easier said than done, but you could argue that the cloud providers are still giving you reasonable value if you can pass the buck on a given issue for x years.

You have to know how to write code that fits into the cloud. You can't arbitrarily read/write to the file system, acting as if there's only one instance of the server running (if you plan to run hundreds or thousands). So even by waving the cloud 'magic wand', you still need to understand writing code in a cloud-friendly way. So in some sense, it's a shared responsibility between the vendor and engineering. You need to understand how to apply the tools being given to you.
Per the article, literally nothing in their code would have solved the issue. AWS was supposed to auto-scale TGWs and didn't.

>Our own serving systems scale quickly to meet these kinds of peaks in demand (and have always done so successfully after the holidays in previous years). However, our TGWs did not scale fast enough. During the incident, AWS engineers were alerted to our packet drops by their own internal monitoring, and increased our TGW capacity manually. By 10:40am PST that change had rolled out across all Availability Zones and our network returned to normal, as did our error rates and latency.

Correct, I was disputing the point that you can freely code without being mindful of the architecture even though the selling point of cloud providers is "focus on code, leave architecture to us". I'm not disputing in this case AWS was at fault: as the customer, Slack did everything right.
> Isn't that literally supposed to be the sales pitch for the cloud?

Yes.

But a sales pitch is the most positive framing of the product possible. I wouldn't rely on the sales pitch when making the decision about how much you should depend on the cloud.

>This usually works well, under the rationale that "upstream provider does this for a living, so they must be better than us at this", but if you have too unique needs (or are just a bit "unlucky"), it can fail too.

Heh, a while ago I joked that one way to scale is to "make it somebody else's problem", with the proviso that you need to make sure that the someone else can handle the load. And then (due to the context) a commenter balked at the idea that a big player like YouTube would be unable handle the scaling of their core business.

https://news.ycombinator.com/item?id=23170685

(If they're really blaming it on AWS, it really takes guts to do it so publicly, I think.)

I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck

The issue was a transit gateway, a core network component. If they weren't in the cloud, this would have been a router, so they "outsourced" it in the same way an on-prem service outsources routing to Cisco. I guess the difference is they might have had better visibility into the Cisco router and known it was overloaded.

I don't think that's true. Slack seems to have their core online services split across a number of VPCs, and for some reason decided to use Transit Gateway to connect them. Transit Gateway is a special-purpose solution that is geared toward cross-region and on-prem to VPC connections in corporate networks, not to global high-traffic consumer products. It's the wrong tool for the job. Its architecture is antithetical to the other horizontally scalable AWS solutions. It introduces a single (up to) 50 gbps network hub that all inter-service traffic must go through. Native AWS architectures avoid such central hubs and provide a virtual routing fabric instead.

Slack could have chosen one of many other AWS design patterns such as VPC peering, transit VPC, IGW routing, or colocating more services in fewer VPCs (with more granular IAM role policies to separate operator privileges), to provide an automatically scaled network fabric to connect their services.

(This isn't to criticize Slack's engineering team. They have successfully scaled their service in a short time, and I'm happy with their product overall, and with their transparency in this report. But I think AWS has the world's biggest and most scalable network fabric - it's just a matter of knowing how to harness it.)

If your oldest request was queued 5+ seconds ago in a near-realtime system (such as Slack), CPU usage isn't your biggest problem.

Slack wrote an autoscaling implementation that ignored request queue depth and downsized their cluster based on CPU usage alone, so while they knew how to resolve it, I would not go so far as to say they knew how to prevent it. The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.

> The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.

What's the first?

Non-randomized wallclock integers.

For example: “sleep 60 seconds”, “cron 0 * * * * command”, “X-Retry-After: 300”

Found in: recurring jobs, backoff algorithms, oauth tokens.

Found in: ops-created tasks, dev-released software.

I'm building something at Cronitor to help detect those hot-spots! If you want to learn more, email me: shane at cronitor.io
Tell us more here!
Well Slack depended on the Cloud(tm).

It is a interesting though because a lot of the blog posts like "How we handled a 3000% traffic increase overnight!" boil down to "We turned up the AWS knob".

What happens when the AWS knob doesn't work?

You do what Slack did and call the maker of the AWS Knob:tm:.
Some of my co-workers came from active.com (a website that lets people register for marathons and events). The infrastructure had to handle massive spikes because registrations for big races would open all at once, so scalability was everything.

They explained to me that they'd intentionally slam the production website with external traffic a couple of times per year, at a scheduled time in the middle of the night. Like basically an order of magnitude greater than they'd every received in real life, just to try to find the breaking point. The production website would usually go down for a bit, but this was vastly better than the website going down when actual real users are trying to sign up for the Boston Marathon.

Slack probably should've anticipated this surge in traffic after the holidays, and if might have been able to run some better simulations and fire drills before it occurred.

The problem you run into is that while you can load test your website with no problems, when running on shared infrastructure (AWS), you have to account for everyone's website being under load at the same time. That isn't as easy to test or find bottlenecks for.
Very good test. The guys at iracing.com should have done this before organising the e-sports Daytona 24 hours race last week, it was by far their largest event (boosted by Covid lockdown). It crashed their central scheduling service with a database deadlock. Classic case of a bug you only find under heavy load.
> During the incident, AWS engineers were alerted to our packet drops by their own internal monitoring, and increased our TGW capacity manually. By 10:40am PST that change had rolled out across all Availability Zones and our network returned to normal, as did our error rates and latency.

Sounds like AWS knew how to handle it too.

Given how AWS has responded to past events like this, I'd bet there's an internal post-mortem and they'll add mechanisms to fix this scaling bottleneck for everyone.

Although one thing I'm not clear on is if this was really an AWS issue or if Slack hit one of the documented limits of Transit Gateway (such as bandwidth), after which AWS started dropping packets. If that's the case then I don't see what AWS could have done here, other than perhaps have ways to monitor those limits, if they don't already. The details here are a bit fuzzy in the post.

If you hit yourself on the thumb while using a hammer, do you blame the hammer manufacturer or yourself? TGW limits are well documented.
I mean, when the hammer manufacturer sells managed, auto scaling thumb-avoiding services you might rely on that.

If I understand correctly they didn't initially hit a TGW quota, it just didn't scale up fast enough.

"Hey Boss, this system that our team selected and configured, behaved as documented but not in a way that protected our customers' experience.

It's Amazon's fault, not ours..."

If someone came to me with that, I'd educate them on how I saw it quite differently, politely but firmly.

Unless I'm misunderstanding something the system did not perform as documented. It should have scaled, it didn't.

When a critical piece of infrastructure fails under massive load I'm not sure it it'll help much when you politely tell your engineers they fucked up for not anticipating it.

You learn lessons. Both Slack and AWS seem to have learnt lessons here.

I agree with much of what you say, but if you change it to "It's Amazon's fault, not ours", that's where I diverge.

Slack did fuck up here, as evidenced by the outage and you seem to at least partially agree by the fact that Slack learned a lesson. Further, I think that "understanding how your system scales up from a low baseline to a high level of utilization (such as Black Friday/Cyber Monday for e-commerce, or special event launches, or a SuperBowl ad landing page)" is a standard, "par for the course" cloud engineering topic to be on top of nowadays.

If you have a known increase in traffic at a certain date/time that auto-scaling (either EC2 or NLB/ALB or another service) can't handle fast enough you can let AWS know through your support contract to over-provision during that time or the scale up will take too long.
I remember going to a presentation by someone from FanDuel where he discussed something similar. Their usage patterns (heavy spike on NFL Sundays) caused similar problems with infrastructure that expected more gradual build-up. They engineered for it with synthetic traffic in advance of their expected spike to ensure their infrastructure was warm.

TL;DR it’s still your responsibility to understand the limitations of your infrastructure decisions and engineer your systems accordingly.

No, using AWS does not absolve you of that responsibility. The game of paying a vendor to be an engineer is that you have to have strategies to test this kind of stuff.

Slack didn’t know how to handle it, they paid AWS hoping the product did what it said on the tin. They didn’t test for this case and got bit.

They have millions of clients they could have coordinated to load test this stuff by picking some time to disable the cache and fallback to cache if it failed.

True, but with any managed service, hidden limits and a cloud provider's own engineering (or lack thereof) may come back to bite the top 0.1% (the whales).

One approach to solve problems of scale is to trim down scale and bound it across multiple disparate silos that do not absolutely interact with each other at all, under any circumstances, except for making quick, constant-time, scale-independent decisions, may be.

In short, do things that don't need scale.

AWS might have had back-to-work traffic in lots of domains simultaneously.

Or maybe their monitoring and response staff was just coming back online.

Lots of other services that use AWS didn't go down the same day -- because they provisioned enough AWS capacity.
Where's the button to provision more TGW capacity?
I actually thought something in AWS was a cause but did not know anything about how TGW works internally.
I actually thought something in AWS was a cause but did not know about these internal systems.
But what HN predicted was wrong https://news.ycombinator.com/item?id=25632346

"My bet is that this incident is caused by a big release after a post-holiday "code freeze". "

HN comments suggested a broad plethora of things. I’m not surprised some happened on the right cause.
That's very kind of you to remember :)
But Slack has been around for longer than a year, right? Shouldn't they have noticed this happening earlier?

I mean, considering Slack is mostly used as a workplace chat mechanism, they should have faced this kind of a scenario previously and had a solution for this by now.

Yeah but this year the number of people working from home that would connect to slack directly at the beginning of their work day must be much much larger than the other years
Another small but potentially relevant detail: Not many people vacationed, so more people would have returned to work at standard times. Many people travel over holidays for example (usually) but this time around in many places it wasn't even an option. Other people extend their holidays to relax more, but I don't know any people interested in staycations in their house. We've had enough of it.
But also smaller because much fewer people went on a long traveling vacation away from laptop.