Hacker News new | ask | show | jobs
by 0dmethz 1962 days ago
I mean, when the hammer manufacturer sells managed, auto scaling thumb-avoiding services you might rely on that.

If I understand correctly they didn't initially hit a TGW quota, it just didn't scale up fast enough.

1 comments

"Hey Boss, this system that our team selected and configured, behaved as documented but not in a way that protected our customers' experience.

It's Amazon's fault, not ours..."

If someone came to me with that, I'd educate them on how I saw it quite differently, politely but firmly.

Unless I'm misunderstanding something the system did not perform as documented. It should have scaled, it didn't.

When a critical piece of infrastructure fails under massive load I'm not sure it it'll help much when you politely tell your engineers they fucked up for not anticipating it.

You learn lessons. Both Slack and AWS seem to have learnt lessons here.

I agree with much of what you say, but if you change it to "It's Amazon's fault, not ours", that's where I diverge.

Slack did fuck up here, as evidenced by the outage and you seem to at least partially agree by the fact that Slack learned a lesson. Further, I think that "understanding how your system scales up from a low baseline to a high level of utilization (such as Black Friday/Cyber Monday for e-commerce, or special event launches, or a SuperBowl ad landing page)" is a standard, "par for the course" cloud engineering topic to be on top of nowadays.