| Yes! Scaling problems are extremely nice to have. May you be so blessed that all of your problems are scaling problems! People spend so much time trying to preemptively solve scaling problems and proactively design around scale. I am being 100% serious when I say that this is an absolute total mistake. Scaling problems are just the absolute best to have, you shouldn't run away screaming from them before they happen. Twitter had the fail whale for years, you know what it didn't do? Tank the business. In scaling problems: • You have money flowing in, and confidence that the thing is working. • The load on the system has been dynamically adjusted down for you as your users have a suboptimal experience, giving you smaller messes to clean up. • Often there is an expensive short-term solution to remove the egg on your face if so desired. You have one Really Important client who is pissed, you give them their own private app with twice the hardware until you can get this resolved, and now they love you for life because they see that you did backflips for them. • You can profile the system. Complex systems, under instrumentation, literally tell you where to look for what is wrong. “Why is that taking 400ms?!” • You can run side-by-side tests, clone the input from the stressed system to the new system and verify that the load has decreased while the underlying model is the same. • Very often you will find logical bugs that you needed to fix anyway, they just become much more apparent when the system has high contention. Two things were implicitly ordered by time and thus did not break unless requests were concurrent, so they were doomed to break eventually, but now they break almost immediately. • Your corporate overlords immediately understand the business value in getting the situation resolved “correctly” whereas convincing them of refactors usually takes time. “We got the system limping along for now but if we don’t fix the underlying issue soon then this will happen again.” “Well we are really happy with how explosively this product is going, I guess we can delay those other two apps while you fix up this one.” • At the same time that license is not a license to dither, as usually happens with such speech. You feel committed to investigating and refactoring and fixing, with that time, because the uncertainty of the system is a sort of abstract menace exerting deadline pressure. It is legitimately exhilarating to figure out and fix the actual problems, and gives you a good feeling about your contribution. This is why I would generally tell people to not start out a project with Kubernetes, for example. You are trying to up-front the cost of a scaling problem but that scaling problem is gonna be really nice to have. Do it later! Get some egg on your face. There is an abstract reason too, which I should mention. Lots of systems die because they do not have a “revenue problem”—long term they will make enough money to cover their costs—but a “cashflow problem”—short-term they overcommit and run out of money. You fail to pay your employees on-time one month, and by the next month you find yourself potentially needing to spend time and money hiring new employees to replace the ones you just terrified. So you fail to deliver a functionality on-time and a new client chooses a competitor. Stuff like that. Can happen just because if you have N payments coming in the uncertainty of those payments can grow like √N while you keep only a constant buffer of cash-on-hand, but can happen in more elaborate ways. Almost always you can ease the pressure by deferring payments until the latest possible. When I was planning my wedding, my wife wanted to pay everything as soon as possible to save on our mental load; I resisted. And that was clutch because there were times during this when we were saying “ok we got $100 left to cover food and gas this week before I get paid on Friday, what can we do with that?” where if we had up-fronted those costs, presumably we would have unnecessarily eaten ramen or less for the three weeks before. Like it is easy to say “here is what we can save per month, so this is a reasonable budget for the wedding that won’t kill us”—solving the revenue problem. But cashflow problems still exist even when you know that you will be able to cover the cost eventually. Solving a scaling problem is up-fronting a cost that can easily be delayed and if you don’t delay it then there is a very real chance that your project can be canned way before you run into that scaling problem. Especially let me link @jackdied’s talk “Stop Writing Classes” where this is a theme, https://youtu.be/o9pEzgHorH0 Something something like “They subclassed a dict. Because they might need to add some functionality to it later. You know what? You can just do that later!” Same with scaling problems. This won’t scale, but we’ll worry about that later. And then when we worry about that, we will know by measurement whether our shiny new cloud machines need to be RAM-optimized or CPU-optimized or whether they need to be cheap-as-possible-but-ten-times-as-many-of-them. Whereas if you up-front this cost you literally are making all of your scaling decisions based on zero data and hunches. |
Building things that don't scale is often taken to the extreme where it's nothing more than a few HTML pages strung together by self-rolled crypto and javascript. This might be good for idea validation (first couple of days), but it shouldn't persist past this point.
Understanding when the inevitable problems pop up and being able to solve them in reasonable enough time not to lose your users is - to me - not a nice to have problem. It's a showstopper that will turn egg on your face problems into fatalities.
Few things I would recommend are:
* Instrumentation - You don't need to go full ELK before you find product-market fit. Hell you can log everything, and with two users you can always read all the logs. But something is key. * Up and down scripts - How long will it take you to set up a new copy of the product? Do you have to trawl bash logs and have a team meeting to find out all the configs that need to be set up? * Load-testing - Again, nothing fancy. Just put curl on a bash loop and see where the system starts to hiccup. Once you have real users with complex data on real hardware you'll wish you profiled things, even slightly. * Instrumentation - I say it again because the sheer number of systems I've found on the verge of silent failure and user frustration because the errors weren't being propagated, be it from client side to server or from the logs to an alert, is massive. With modern SPAs it's easy to think everything works when all your users see is a blank page. * Please don't roll your own crypto. Unless you're working in a language that is 2 days old you really shouldn't need to or be doing this.
In short, if you're using HPAs and Kubernetes you've gone too far. But don't choose your instance size at random, be able to set up a new instance in an hour, and have at least 60% confidence that if there is an error you will know about it.