Hacker News new | ask | show | jobs
by rrrrrrrrrrrryan 1964 days ago
Some of my co-workers came from active.com (a website that lets people register for marathons and events). The infrastructure had to handle massive spikes because registrations for big races would open all at once, so scalability was everything.

They explained to me that they'd intentionally slam the production website with external traffic a couple of times per year, at a scheduled time in the middle of the night. Like basically an order of magnitude greater than they'd every received in real life, just to try to find the breaking point. The production website would usually go down for a bit, but this was vastly better than the website going down when actual real users are trying to sign up for the Boston Marathon.

Slack probably should've anticipated this surge in traffic after the holidays, and if might have been able to run some better simulations and fire drills before it occurred.

2 comments

The problem you run into is that while you can load test your website with no problems, when running on shared infrastructure (AWS), you have to account for everyone's website being under load at the same time. That isn't as easy to test or find bottlenecks for.
Very good test. The guys at iracing.com should have done this before organising the e-sports Daytona 24 hours race last week, it was by far their largest event (boosted by Covid lockdown). It crashed their central scheduling service with a database deadlock. Classic case of a bug you only find under heavy load.