| HN Mirror

Maybe your application/environment was stable enough that site performance didn't change drastically over a decade. I've seen performance shift dramatically with the same number of users for different reasons (high-post flattened comment threads being slammed when Michael Jackson died, software upgrades which mysteriously load up servers on some requests, and of course feature adds). When a site issue occurs, I don't reach for "how many users are logged in?", I reach for my sorted list of machine and service-specific metrics and look for irregularities.

That being said, trends in user visits are of course great numbers for capacity planning because you have an idea how much growth to expect in the near future. But it's only a vague multiplier; you need to know how beefy a box to get (by stress testing to determine capacity) and then multiply by the growth factor. But it's usually more complicated than this.

Stress testing doesn't have to be a formal process in all environments. You might just have a developer with a new chat server and they want to get a benchmark of how many users can join and chat before CPU peaks. An hour or two of coding should provide a workable test on like-hardware, which can then be generalized with tests of other software to give an idea of the capacity when a certain number of users are logged in and performing the same operations. The point isn't to know 100% when you will fall over, but to have at least an idea when you're going to fall over, so you don't have to actually fall over to figure out when and where to scale.

I have no problems with very-short-term big lab stress testing. We had the same issue at my last place, and with lots of caution, it worked fine. But jesus christ, if I told my bosses "I think we should run all the servers with extra load until they fall over, then re-evaluate", they'd look at me like I had antlers growing out of my head.