Hacker News new | ask | show | jobs
by rigel_kentaurus 2145 days ago
This can also be summarized as "Early on, don't be afraid of doing things that don't scale".

In this particular example, my "reset password" functionality could be sending me an email, and I have to reset the password manually and email you a temporary one that you can use. Bad solution? Yes. Doesn't scale? Of course. But if you have 5 customers it's not a big deal and you can use your time on something else.

The key here is to have good people in the team. A decent Senior Engineer / Architect would know that 'Reset password' will be needed. That is why a very technically solid person is gold in the first round, because they can design the signup in such a way that it's prepared for future requirements. They of course would also know that you can't store in plain text, that it should be hashed, that you might have to use a vault, that you need to randomize the hash, and that in the future limit access to the "user" table and never ever expose it through an endpoint where a user can reach it, as well as protecting against XSS and SQL injections.

Which of the list above goes off the list? There is a balance there in getting the right solution without scope creep, but also not implementing something so silly that it's going to get you into trouble later. The genius of a competent person is being able to tell what needs to be solved now vs. later

4 comments

Yes! Scaling problems are extremely nice to have. May you be so blessed that all of your problems are scaling problems!

People spend so much time trying to preemptively solve scaling problems and proactively design around scale. I am being 100% serious when I say that this is an absolute total mistake. Scaling problems are just the absolute best to have, you shouldn't run away screaming from them before they happen. Twitter had the fail whale for years, you know what it didn't do? Tank the business. In scaling problems:

• You have money flowing in, and confidence that the thing is working.

• The load on the system has been dynamically adjusted down for you as your users have a suboptimal experience, giving you smaller messes to clean up.

• Often there is an expensive short-term solution to remove the egg on your face if so desired. You have one Really Important client who is pissed, you give them their own private app with twice the hardware until you can get this resolved, and now they love you for life because they see that you did backflips for them.

• You can profile the system. Complex systems, under instrumentation, literally tell you where to look for what is wrong. “Why is that taking 400ms?!”

• You can run side-by-side tests, clone the input from the stressed system to the new system and verify that the load has decreased while the underlying model is the same.

• Very often you will find logical bugs that you needed to fix anyway, they just become much more apparent when the system has high contention. Two things were implicitly ordered by time and thus did not break unless requests were concurrent, so they were doomed to break eventually, but now they break almost immediately.

• Your corporate overlords immediately understand the business value in getting the situation resolved “correctly” whereas convincing them of refactors usually takes time. “We got the system limping along for now but if we don’t fix the underlying issue soon then this will happen again.” “Well we are really happy with how explosively this product is going, I guess we can delay those other two apps while you fix up this one.”

• At the same time that license is not a license to dither, as usually happens with such speech. You feel committed to investigating and refactoring and fixing, with that time, because the uncertainty of the system is a sort of abstract menace exerting deadline pressure. It is legitimately exhilarating to figure out and fix the actual problems, and gives you a good feeling about your contribution.

This is why I would generally tell people to not start out a project with Kubernetes, for example. You are trying to up-front the cost of a scaling problem but that scaling problem is gonna be really nice to have. Do it later! Get some egg on your face.

There is an abstract reason too, which I should mention. Lots of systems die because they do not have a “revenue problem”—long term they will make enough money to cover their costs—but a “cashflow problem”—short-term they overcommit and run out of money. You fail to pay your employees on-time one month, and by the next month you find yourself potentially needing to spend time and money hiring new employees to replace the ones you just terrified. So you fail to deliver a functionality on-time and a new client chooses a competitor. Stuff like that. Can happen just because if you have N payments coming in the uncertainty of those payments can grow like √N while you keep only a constant buffer of cash-on-hand, but can happen in more elaborate ways. Almost always you can ease the pressure by deferring payments until the latest possible.

When I was planning my wedding, my wife wanted to pay everything as soon as possible to save on our mental load; I resisted. And that was clutch because there were times during this when we were saying “ok we got $100 left to cover food and gas this week before I get paid on Friday, what can we do with that?” where if we had up-fronted those costs, presumably we would have unnecessarily eaten ramen or less for the three weeks before. Like it is easy to say “here is what we can save per month, so this is a reasonable budget for the wedding that won’t kill us”—solving the revenue problem. But cashflow problems still exist even when you know that you will be able to cover the cost eventually. Solving a scaling problem is up-fronting a cost that can easily be delayed and if you don’t delay it then there is a very real chance that your project can be canned way before you run into that scaling problem.

Especially let me link @jackdied’s talk “Stop Writing Classes” where this is a theme,

https://youtu.be/o9pEzgHorH0

Something something like “They subclassed a dict. Because they might need to add some functionality to it later. You know what? You can just do that later!” Same with scaling problems. This won’t scale, but we’ll worry about that later. And then when we worry about that, we will know by measurement whether our shiny new cloud machines need to be RAM-optimized or CPU-optimized or whether they need to be cheap-as-possible-but-ten-times-as-many-of-them. Whereas if you up-front this cost you literally are making all of your scaling decisions based on zero data and hunches.

Agreed about scaling problems, but the examples you provided illustrate that, in my experience, there are still things you should be doing from the get-go.

Building things that don't scale is often taken to the extreme where it's nothing more than a few HTML pages strung together by self-rolled crypto and javascript. This might be good for idea validation (first couple of days), but it shouldn't persist past this point.

Understanding when the inevitable problems pop up and being able to solve them in reasonable enough time not to lose your users is - to me - not a nice to have problem. It's a showstopper that will turn egg on your face problems into fatalities.

Few things I would recommend are:

* Instrumentation - You don't need to go full ELK before you find product-market fit. Hell you can log everything, and with two users you can always read all the logs. But something is key. * Up and down scripts - How long will it take you to set up a new copy of the product? Do you have to trawl bash logs and have a team meeting to find out all the configs that need to be set up? * Load-testing - Again, nothing fancy. Just put curl on a bash loop and see where the system starts to hiccup. Once you have real users with complex data on real hardware you'll wish you profiled things, even slightly. * Instrumentation - I say it again because the sheer number of systems I've found on the verge of silent failure and user frustration because the errors weren't being propagated, be it from client side to server or from the logs to an alert, is massive. With modern SPAs it's easy to think everything works when all your users see is a blank page. * Please don't roll your own crypto. Unless you're working in a language that is 2 days old you really shouldn't need to or be doing this.

In short, if you're using HPAs and Kubernetes you've gone too far. But don't choose your instance size at random, be able to set up a new instance in an hour, and have at least 60% confidence that if there is an error you will know about it.

A decent eng will know that there are managed solutions for this problem that you don't have to build yourself.

In 2020, with the number of managed solutions out there for simple CRUD functionality, the default should be just using the managed thing until it doesn't suit you, not just ignoring critical yet non-unique pieces of infrastructure all together.

Reality is nowhere near this rosy. Sure, maybe you can get away with canned auth, but I took that as illustrative.

A more practical example: Start/stop/change subscription plans. Until 12 days ago, Stripe had no customer portal so you had to implement this yourself. You can do it with support email, and "easy to switch plans" is not on any prospective customer's checklist.

"Too many change plan emails" is a good problem to have - deprioritize the feature. And if you're lucky, by the time you really need it, your billing service will have implemented the feature for you.

Not necessarily. If you have 0 customers but your MVP already depends on a bunch of external services, you have hard lock-ins or running costs from the beginning. Scaling and putting everything on scalable infra should be reserved for projects that make it past a couple of paying customers (= validation).
Don't be penny-wise and pound foolish though. Let's pretend I take a month to design, pick, configure, operationalize, and integrate with the project, a pub/sub bus that will cost $0 while it's not running by developing it as VMs on hardware I own. A hypothetical alternative costs $x/month for the first Y billion messages. If I'll never exceed 1,000 messages/month, having a hard rule to avoid running costs would argue for taking a month to build such a system.

A more pragmatic approach involves examining X and Y before deciding not to pay any money. (They're also different for everybody, and it also depends on your financial situation.) If X is less than the cost of 1 coffee from a local coffee shop, I personally would pay for it rather than spend the month building my own $0 monthly unused cost pubsub bus for 1 month.

(Time it takes to implement is a 3rd variable, Z, though I'll note that estimating software project is notoriously difficult to get right.)

> In this particular example, my "reset password" functionality could be sending me an email, and I have to reset the password manually and email you a temporary one that you can use.

That's exactly what I just did at the soon-to-launch startup I'm currently working at. There is a change password function, but no password reset (or other systems that would require generating stateful links for emailed). As an added bonus, if you try to sign up with an email address that is already in use, the automated email you receive (from an address monitored by a human) includes something to the effect of "If you aren't sure of your password, reply to this email and we'll get it straightened out."

> That is why a very technically solid person is gold in the first round, because they can design the signup in such a way that it's prepared for future requirements. They of course would also know that you can't store in plain text, that it should be hashed, that you might have to use a vault, that you need to randomize the hash, and that in the future limit access to the "user" table and never ever expose it through an endpoint where a user can reach it, as well as protecting against XSS and SQL injections.

Wrong. The correct technical decision is to bundle in a library that solves this for you - login with Google, or Facebook, or GitHub, or OIDC, pick one according to the context. Get password reset, MFA, password security, etc. for free.

Why reinvent the wheel, poorly?

The thing is, that's not just a technical decision. There are strategy, product, and operational reasons why you might decide to add or avoid SSO. SSO for sign-on can be really complicated and have lots of caveats -- for example, Apple requires any app with any SSO option to also support sign in with Apple, or be removed from the app store. Or you may want to restrict certain SSO to enterprise customers only, as a feature, but not have any of them yet. Or your users might have privacy concerns about SSO, or not have accounts at your SSO providers.

Any of these things may or may not be relevant to your decision, and I know that authentication is a specific example within a larger point. What I'm trying to say is that it's never correct to say "<x> is the right approach for all <y> trying to do <z>". Every decision has consequences, and what really matters is your ability to foresee those consequences and weigh them against the broader strategic picture of what you're trying to accomplish.

I think this was an example based on the op example.
Assuming that Google or Facebook or Github has implemented a library correctly, and that it will remain available as long as you want your app to function is dangerous.
As opposed to assuming that you'll implement it correctly when Google or Facebook or github haven't?
Yes, because what's good for them is not necessarily what's good for you.