| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dplgk 1964 days ago

> On January 4th, one of our Transit Gateways became overloaded. The TGWs are managed by AWS and are intended to scale transparently to us. However, Slack’s annual traffic pattern is a little unusual: Traffic is lower over the holidays, as everyone disconnects from work (good job on the work-life balance, Slack users!). On the first Monday back, client caches are cold and clients pull down more data than usual on their first connection to Slack. We go from our quietest time of the whole year to one of our biggest days quite literally overnight.

What's interesting is that when this happened, some HN comments suggested it was the return from holiday traffic that caused it. Others said, "nah, don't you think they know how to handle that by now?"

Turns out occam's razor applied here. The simplest answer was the correct one. Return-from-holiday traffic.

5 comments

cett 1964 days ago

Though the nuance is Slack did know how to handle it, AWS didn't.

fipar 1964 days ago

I don't mean this ironically, but I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck.

This usually works well, under the rationale that "upstream provider does this for a living, so they must be better than us at this", but if you have too unique needs (or are just a bit "unlucky"), it can fail too.

All this to say that the cloud isn't magic. From a risk/error prevention point of view, it's not that different from writing software for a single local machine: not every programmer needs to know how to manually do memory management, it makes a lot more sense to rely on your OS and malloc (and friends) for this, but the caveat is that you do need to account for the fact that malloc may fail. In the cloud case, one can't just assume that you'll always be able to provision a new instance, scale up a service, etc. The cloud is like a utility company: normally very reliable, but they do fail too.

tw04 1964 days ago

>I don't mean this ironically, but I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck.

Isn't that literally supposed to be the sales pitch for the cloud? Get away from the infrastructure as a whole so you can focus on code, and let the cloud providers wave their magic wand to enable scaling?

If you're saying now the story is: well rely on them to auto scale, until they don't - then why would I bother? Now you're telling me I need to go back to having infrastructure experts, which means I can save TON of money by going with a hosting provider that allows allocation of resources via API (which is basically all of them).

solidasparagus 1964 days ago

No, the cloud provides scalable infrastructure, but once you are in the 0.01% and you have very unique usage patterns, you still need to know how to set up your infrastructure for your needs. The difference is that instead of writing and managing a scalable cache, you just need to build the layer that knows to pre-provision for that scale/talk with AWS to make sure the system has sufficient capacity.

The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.

thu2111 1964 days ago

This feels like AWS apologism.

Slack knew how to set up their infrastructure. Nothing in the postmortem implies AWS was misconfigured. AWS spotted the problem and fixed it entirely on their side.

Nothing in this report suggests that Slack has unique usage patterns. Users returning to work after Christmas is not a phenomenon unique to Slack.

Their problems were:

1. The AWS infrastructure broke due to an event as predictable as the start of the year. That's on Amazon.

2. Their infrastructure is too complicated. Their auto-scaling created chaos by shutting down machines whilst engineers were logged into them due to bad heuristics, although it's not like this was a good way to save money, and their separation of Slack into many different AWS accounts created weird bottlenecks they had no way to understand or fix.

3. They were unable to diagnose the root cause and the outage ended when AWS noticed the problem and fixed their gateway system themselves.

The cloud isn't some magic thing that solves all scaling problems

In this case it actually created scaling problems where none needed to exist. AWS is expensive compared to dedicated machines in a colo. Part of the justification for that high cost is seamless scalability and ability to 'flex'.

But Slack doesn't need the ability to flex here. Scaling down over the holidays and then back up once people returned to work just isn't that important for them - it's unlikely there were a large number of jobs queued up waiting to run on their spare hardware for a few days anyway. It just wasn't a good way to save money: a massive outage certainly cost them far more than they'll ever save.

gowld 1964 days ago

It wasn't scaling "back up". It was a huge spike as evertone refilled cache at the same time.

It's similar to Black Friday spikes Amazon handles themselves.

paranoidrobot 1964 days ago

> The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.

I don't think anyone who's got any reasonable level of experience is expecting that it's a magic wand.

There are, though some things in AWS (and for sure other cloud providers) where you get no useful signals or controls. It's entirely managed by the cloud provider, based on their own internal metrics and scaling behaviors.

Behind the scenes, their load balancer services don't give you indications of how heavily loaded they are - nor do you get to directly control how many/big those load balancers are.

In some parts you can hack around this by pre-warming infrastructure by generating fake traffic - but that assumes that you have those metrics and knowledge that you even need to do this.

This applies to all sorts of things - there's hidden caps and other capacity limits all over AWSs platform that you don't know about until you hit them. There's even capacity limits that you can know about, because they're publicly documented, but AWS lies and won't tell you the actual limit being applied to your account - the console and documentation says one thing, but in reality it's a lot lower.

If that capacity limit resulted in an outage, well, tough luck.

twblalock 1964 days ago

If you are serious about reliability you always need infrastructure experts.

AWS is pretty good about documenting the limits of their systems, SLAs, how to configure them, etc. They don't just say you should wave a magic wand -- and even if they did say that, professional software engineers know better.

"a hosting provider that allows allocation of resources via API" is exactly what AWS is. Your infrastructure experts come into the picture because they need to know which resources to request, how to estimate the scale they need, and how to configure them properly. They should also be doing performance testing to see if the claimed performance really holds up.

remus 1964 days ago

> Isn't that literally supposed to be the sales pitch for the cloud? Get away from the infrastructure as a whole so you can focus on code, and let the cloud providers wave their magic wand to enable scaling?

Clearly there are limits even with the largest cloud providers. You'll have to engage a bit of critical thought in to whether you're going to get near those limits and what that might mean for your product. Obviously that's easier said than done, but you could argue that the cloud providers are still giving you reasonable value if you can pass the buck on a given issue for x years.

JMTQp8lwXL 1964 days ago

You have to know how to write code that fits into the cloud. You can't arbitrarily read/write to the file system, acting as if there's only one instance of the server running (if you plan to run hundreds or thousands). So even by waving the cloud 'magic wand', you still need to understand writing code in a cloud-friendly way. So in some sense, it's a shared responsibility between the vendor and engineering. You need to understand how to apply the tools being given to you.

tw04 1964 days ago

Per the article, literally nothing in their code would have solved the issue. AWS was supposed to auto-scale TGWs and didn't.

>Our own serving systems scale quickly to meet these kinds of peaks in demand (and have always done so successfully after the holidays in previous years). However, our TGWs did not scale fast enough. During the incident, AWS engineers were alerted to our packet drops by their own internal monitoring, and increased our TGW capacity manually. By 10:40am PST that change had rolled out across all Availability Zones and our network returned to normal, as did our error rates and latency.

JMTQp8lwXL 1964 days ago

Correct, I was disputing the point that you can freely code without being mindful of the architecture even though the selling point of cloud providers is "focus on code, leave architecture to us". I'm not disputing in this case AWS was at fault: as the customer, Slack did everything right.

ncallaway 1964 days ago

> Isn't that literally supposed to be the sales pitch for the cloud?

Yes.

But a sales pitch is the most positive framing of the product possible. I wouldn't rely on the sales pitch when making the decision about how much you should depend on the cloud.

SilasX 1964 days ago

>This usually works well, under the rationale that "upstream provider does this for a living, so they must be better than us at this", but if you have too unique needs (or are just a bit "unlucky"), it can fail too.

Heh, a while ago I joked that one way to scale is to "make it somebody else's problem", with the proviso that you need to make sure that the someone else can handle the load. And then (due to the context) a commenter balked at the idea that a big player like YouTube would be unable handle the scaling of their core business.

https://news.ycombinator.com/item?id=23170685

(If they're really blaming it on AWS, it really takes guts to do it so publicly, I think.)

Johnny555 1964 days ago

I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck

The issue was a transit gateway, a core network component. If they weren't in the cloud, this would have been a router, so they "outsourced" it in the same way an on-prem service outsources routing to Cisco. I guess the difference is they might have had better visibility into the Cisco router and known it was overloaded.

ak217 1964 days ago

I don't think that's true. Slack seems to have their core online services split across a number of VPCs, and for some reason decided to use Transit Gateway to connect them. Transit Gateway is a special-purpose solution that is geared toward cross-region and on-prem to VPC connections in corporate networks, not to global high-traffic consumer products. It's the wrong tool for the job. Its architecture is antithetical to the other horizontally scalable AWS solutions. It introduces a single (up to) 50 gbps network hub that all inter-service traffic must go through. Native AWS architectures avoid such central hubs and provide a virtual routing fabric instead.

Slack could have chosen one of many other AWS design patterns such as VPC peering, transit VPC, IGW routing, or colocating more services in fewer VPCs (with more granular IAM role policies to separate operator privileges), to provide an automatically scaled network fabric to connect their services.

(This isn't to criticize Slack's engineering team. They have successfully scaled their service in a short time, and I'm happy with their product overall, and with their transparency in this report. But I think AWS has the world's biggest and most scalable network fabric - it's just a matter of knowing how to harness it.)

floatingatoll 1964 days ago

If your oldest request was queued 5+ seconds ago in a near-realtime system (such as Slack), CPU usage isn't your biggest problem.

Slack wrote an autoscaling implementation that ignored request queue depth and downsized their cluster based on CPU usage alone, so while they knew how to resolve it, I would not go so far as to say they knew how to prevent it. The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.

nicoburns 1964 days ago

> The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.

What's the first?

floatingatoll 1964 days ago

Non-randomized wallclock integers.

For example: “sleep 60 seconds”, “cron 0 * * * * command”, “X-Retry-After: 300”

Found in: recurring jobs, backoff algorithms, oauth tokens.

Found in: ops-created tasks, dev-released software.

encoderer 1964 days ago

I'm building something at Cronitor to help detect those hot-spots! If you want to learn more, email me: shane at cronitor.io

floatingatoll 1964 days ago

Tell us more here!

tempest_ 1964 days ago

Well Slack depended on the Cloud(tm).

It is a interesting though because a lot of the blog posts like "How we handled a 3000% traffic increase overnight!" boil down to "We turned up the AWS knob".

What happens when the AWS knob doesn't work?

buildawesome 1964 days ago

You do what Slack did and call the maker of the AWS Knob:tm:.

rrrrrrrrrrrryan 1964 days ago

Some of my co-workers came from active.com (a website that lets people register for marathons and events). The infrastructure had to handle massive spikes because registrations for big races would open all at once, so scalability was everything.

They explained to me that they'd intentionally slam the production website with external traffic a couple of times per year, at a scheduled time in the middle of the night. Like basically an order of magnitude greater than they'd every received in real life, just to try to find the breaking point. The production website would usually go down for a bit, but this was vastly better than the website going down when actual real users are trying to sign up for the Boston Marathon.

Slack probably should've anticipated this surge in traffic after the holidays, and if might have been able to run some better simulations and fire drills before it occurred.

bryan_w 1964 days ago

The problem you run into is that while you can load test your website with no problems, when running on shared infrastructure (AWS), you have to account for everyone's website being under load at the same time. That isn't as easy to test or find bottlenecks for.

t0mas88 1964 days ago

Very good test. The guys at iracing.com should have done this before organising the e-sports Daytona 24 hours race last week, it was by far their largest event (boosted by Covid lockdown). It crashed their central scheduling service with a database deadlock. Classic case of a bug you only find under heavy load.

cle 1964 days ago

> During the incident, AWS engineers were alerted to our packet drops by their own internal monitoring, and increased our TGW capacity manually. By 10:40am PST that change had rolled out across all Availability Zones and our network returned to normal, as did our error rates and latency.

Sounds like AWS knew how to handle it too.

Given how AWS has responded to past events like this, I'd bet there's an internal post-mortem and they'll add mechanisms to fix this scaling bottleneck for everyone.

Although one thing I'm not clear on is if this was really an AWS issue or if Slack hit one of the documented limits of Transit Gateway (such as bandwidth), after which AWS started dropping packets. If that's the case then I don't see what AWS could have done here, other than perhaps have ways to monitor those limits, if they don't already. The details here are a bit fuzzy in the post.

oxfordmale 1964 days ago

If you hit yourself on the thumb while using a hammer, do you blame the hammer manufacturer or yourself? TGW limits are well documented.

0dmethz 1964 days ago

I mean, when the hammer manufacturer sells managed, auto scaling thumb-avoiding services you might rely on that.

If I understand correctly they didn't initially hit a TGW quota, it just didn't scale up fast enough.

sokoloff 1964 days ago

"Hey Boss, this system that our team selected and configured, behaved as documented but not in a way that protected our customers' experience.

It's Amazon's fault, not ours..."

If someone came to me with that, I'd educate them on how I saw it quite differently, politely but firmly.

0dmethz 1964 days ago

Unless I'm misunderstanding something the system did not perform as documented. It should have scaled, it didn't.

When a critical piece of infrastructure fails under massive load I'm not sure it it'll help much when you politely tell your engineers they fucked up for not anticipating it.

You learn lessons. Both Slack and AWS seem to have learnt lessons here.

sokoloff 1964 days ago

I agree with much of what you say, but if you change it to "It's Amazon's fault, not ours", that's where I diverge.

Slack did fuck up here, as evidenced by the outage and you seem to at least partially agree by the fact that Slack learned a lesson. Further, I think that "understanding how your system scales up from a low baseline to a high level of utilization (such as Black Friday/Cyber Monday for e-commerce, or special event launches, or a SuperBowl ad landing page)" is a standard, "par for the course" cloud engineering topic to be on top of nowadays.

jabart 1964 days ago

If you have a known increase in traffic at a certain date/time that auto-scaling (either EC2 or NLB/ALB or another service) can't handle fast enough you can let AWS know through your support contract to over-provision during that time or the scale up will take too long.

curun1r 1964 days ago

I remember going to a presentation by someone from FanDuel where he discussed something similar. Their usage patterns (heavy spike on NFL Sundays) caused similar problems with infrastructure that expected more gradual build-up. They engineered for it with synthetic traffic in advance of their expected spike to ensure their infrastructure was warm.

TL;DR it’s still your responsibility to understand the limitations of your infrastructure decisions and engineer your systems accordingly.

kortilla 1964 days ago

No, using AWS does not absolve you of that responsibility. The game of paying a vendor to be an engineer is that you have to have strategies to test this kind of stuff.

Slack didn’t know how to handle it, they paid AWS hoping the product did what it said on the tin. They didn’t test for this case and got bit.

They have millions of clients they could have coordinated to load test this stuff by picking some time to disable the cache and fallback to cache if it failed.

ignoramous 1964 days ago

True, but with any managed service, hidden limits and a cloud provider's own engineering (or lack thereof) may come back to bite the top 0.1% (the whales).

One approach to solve problems of scale is to trim down scale and bound it across multiple disparate silos that do not absolutely interact with each other at all, under any circumstances, except for making quick, constant-time, scale-independent decisions, may be.

In short, do things that don't need scale.

m463 1964 days ago

AWS might have had back-to-work traffic in lots of domains simultaneously.

Or maybe their monitoring and response staff was just coming back online.

twblalock 1964 days ago

Lots of other services that use AWS didn't go down the same day -- because they provisioned enough AWS capacity.

helper 1964 days ago

Where's the button to provision more TGW capacity?

coldcode 1964 days ago

I actually thought something in AWS was a cause but did not know anything about how TGW works internally.

coldcode 1964 days ago

I actually thought something in AWS was a cause but did not know about these internal systems.

polote 1964 days ago

But what HN predicted was wrong https://news.ycombinator.com/item?id=25632346

"My bet is that this incident is caused by a big release after a post-holiday "code freeze". "

hnlmorg 1964 days ago

HN comments suggested a broad plethora of things. I’m not surprised some happened on the right cause.

floatingatoll 1964 days ago

That's very kind of you to remember :)

thunderbong 1964 days ago

But Slack has been around for longer than a year, right? Shouldn't they have noticed this happening earlier?

I mean, considering Slack is mostly used as a workplace chat mechanism, they should have faced this kind of a scenario previously and had a solution for this by now.

delfaras 1964 days ago

Yeah but this year the number of people working from home that would connect to slack directly at the beginning of their work day must be much much larger than the other years

steve_adams_86 1964 days ago

Another small but potentially relevant detail: Not many people vacationed, so more people would have returned to work at standard times. Many people travel over holidays for example (usually) but this time around in many places it wasn't even an option. Other people extend their holidays to relax more, but I don't know any people interested in staycations in their house. We've had enough of it.

gowld 1964 days ago

But also smaller because much fewer people went on a long traveling vacation away from laptop.