Hacker News new | ask | show | jobs
by phoboslab 1704 days ago
> If you’ve created a database before, you probably had to estimate how many servers to use based on the expected traffic.

The answer is "one". If you have less than 10k req/s you shouldn't even start to think about multiple DB servers or migrating from bog-standard MySQL/MariaDB or Postgres.

I will never understand this obsession with "scaling". Modern web dev seriously over-complicates so many things, it's not even funny anymore.

14 comments

What happens when that database fails? Are you OK losing some data, or do you want the data to be synchronously replicated off the machine and be available somewhere else after failure? Distribution isn't only about scale, it's also about availability.

What happens when that database loses some data? Do you want an up-to-the second backup, or point-in-time recovery? Or are you OK restoring last night's backup? Distribution isn't only about scale, it's also about durability.

What happens when you need to run an expensive business process ad-hoc? Do you want it to be easy to scale out reads, or to export that data to an analytics system? Or are you OK building something else to handle that case? Distribution isn't only about scale, it's also about flexibility.

What happens when you want to serve customers in one market, and make sure that their data stays local for regulatory compliance reasons or latency? Are you OK with having separate databases? Distribution isn't only about scale, it's also about locality.

Fair points. I would argue that for most people a simple master-slave setup with manual failover will produce far fewer headaches than a "serverless" architecture.

When you are big enough to worry about the other issues, you surely are big enough to handle the requirements in-house. I see the dependence on some specific companies as the bigger threat to reliability.

The setup you describe is very much not simple. I worked at a place with very good DBAs and our replication setup caused us more downtime than anything else. Cockroach and Spanner exist because many programmers observed that what you describe is hard.
As a counter-anecdote: multiple startup projects I've worked on with separate MySQL setups where each had just a single master + two slaves (one warm for fast failover in case of hardware failure or upgrades, one cold for slow analytics-style queries) did just fine with millions (to tens of millions) of users. No downtime at all for years on end.

MySQL and Postgres are massively more widely-used than Cockroach and Spanner, broadly very successfully. It's entirely feasible to run it with high uptime.

Very few deployments experience actual failures. Could be some fridge-door/light situation going on.
> fridge-door/light situation going on

what does it mean ?

This is probably one of the best motivations for a distributed database that I've read.

I find that it's not often that people grasp that distribution is about availability. It's obvious when you say it, but for a long time my own intuition was that distribution is about mostly durability or consensus protocols to provide total order across multiple machines. Yet these build together into availability.

In fact, I first noticed this distinction when reading Brian M. Oki's seminal 1988 paper on Viewstamped Replication, the work that would pioneer the field of consensus—a year before Paxos but with an intuitive protocol essentially identical to Raft. The surprising thing is that today many of us might have titled the paper something about "consensus" or "total order" (which it practically invented, and which was the major breakthrough, at least how to do this in the presence of network partitions) but that he titled it "Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems".

I did a short intro talk to Viewstamped Replication (and particularly why FTP or nightly backups or manual failover are not a solution): https://www.youtube.com/watch?v=_Jlikdtm4OA

The talk is followed by interviews with Brian M. Oki and James Cowling (authors of the 1988 and 2012 papers respectively).

This is a good argument against running a k8s cluster when u don't need one, but not a good argument against this new serverless Cockroach product.

Serverless is not just about auto scaling up from 1 to n, it's about autoscaling down from 1 to 0.

If Cockroach provides a robust SQL DB at a marginal cost of ~$0/mo for small request volumes, that is a real value add over running your own pg server.

Not having to deal with administration or backups is another big value add.

This offering looks like it compares very nicely to say running an rds instance with auto backups enabled.

Tradeoffs are tradeoffs.

In your k8s example, by running a k8s cluster when you don't need one, a cost you pay is the overhead.

In the Cockroach serverless case, costs that come to mind include vendor lock-in when you evolve a pattern of production traffic that is hard to migrate to other solutions, and security and compliance challenges due to the virtualized instances running on shared clusters. In many cases these tradeoffs may be worthwhile. My point is that looking at it only in the dimension of scaling up and down, doesn't tell the whole story. The OP doesn't talk about tradeoffs, so in the comment section we must.

In the case you mention, you've made a tradeoff to defer developing an in-house solution for supporting a potential future pattern of production traffic, and that can be a huge accelerator.

Further, once you're experiencing the types of success that demands a superior solution the priority to invest in such a solution is clear.

It isn't "lock-in" - there's enough experience and capability in the market to solve those problems _once you have to_. Solutions like this let you decide whether the right time is at start-up or scale-up.

k8s lets you put multiple deployments on virtualized hardware in a cloud agnostic/consistent way. You can do all that on one machine. Even if you're running a single instance deployment you still probably want at least two environments. K8s isn't without value in that respect.
It seems like you’re defining “scaling” as growth of a workload to the point that it cannot be handled by a single-server DB.

But with any service without a constant workload (I’d wager almost all services besides prototypes that get no users) you’re going to have to literally scale that one machine, by replacing it with a bigger machine. When you have 50 users you’re not going to be paying for some yy.24xlarge. You’ll start with something much more affordable. When the service grows to 50,000 users, you certainly won’t be at “Facebook scale”, but that t3.small isn’t going to cut it. Should your service ever decline, it’d be nice to scale that machine down to save on costs.

At a previous job, we spent many human hours continually ratcheting up the size of our Postgres machine a few times a year. Not only did this take non-trivial engineering hours and mind-space, it also caused maintenance downtime due to the limitations of traditional DBMSs.

Self-managed CockroachDB eliminates the downtime needed to scale. To handle a more intense workload, add machines. If you want to vertically scale each machine, that can be done without downtime too.

CockroachDB Serverless takes this a step further by scaling up and down to suit the demands of a highly dynamic workload, while minimizing costs.

Maybe what looks like an mega-scale obsession to you is actually a bunch of people trying to avoid the common headaches of managing a moderately sized, dynamic service.

During a technical interview many years ago, I commented that the system architecture didn't scale.

The interviewer responded "When that matters, I won't even be managing the person whose problem that is."

Isn't that the right mindset?
This will be somebody else's problem? Ehm, no it's not.

- When the bridge collapses I'll be managing someone else.

- When they find out this airplane has critical design flaws I'll be managing someone else.

- When this software I'm working on is hacked I'll be managing someone else.

It's not a matter of who's problem it is. It's just that a scale-able architecture in most cases is a premature optimization. When building a product, scalability is only one aspect. And in case of most startups and companies, amongst the smaller ones.

I personally interview a lot of people and if they start proposing microservices or k8s or anything trendy like that (before having context), I consider it a negative point.

When hiring, I want someone to take a look at a busines problem, break it down into a smaller pieces. Most of the times, the most important engineering work is coming up with the right data models/data structures.

So yeah, maybe the architecture I have now wont scale up. But at least it'll get the business going. Years later when there are more resources, the software could be rewritten or whatever.

Also, you'r example mentioned "software being hacked". I never said undermine security. Security should always be taken seriously. Security is not a premature optimization. Scalability is.

> Security is not a premature optimization

There are plenty of situations where decent security is a premature optimization too. For example prototypes or proofs-of-concept that are only intended to demonstrate or benchmark a capability. The sort of thing where, even if you did insist on making it 'secure' the username/password would be admin/admin.

I wasn't commenting on whether scalability is important or not. I was commenting on that mindset: "When that matters, I won't even be managing the person whose problem that is."

What you said all sounds reasonable. Sure, don't build something scalable if you've figured out you want something quick to prove your market and you're ok with tossing it all away or if you'll never need the scale. These are not the same thing. But this will be somebody else's problem? I don't want that guy on my team... Something someone told me a long time ago, write code like the guy that takes it over from you is a psychopath and has your address.

All bridges have a capacity limit.

"Doesn't scale" doesn't imply "inadequate."

Overbuilding is waste.

Software isn't bridges though. If you run into scaling issues if you're lucky it's just a case of swapping out a database for a bigger one. More likely it isn't just that though and being able to do things like scale up distributed workers coherently with a change of a config file requires upfront thought and design that YAGNI would say isn't necessary. People just breezily saying "we can optimise it later" for a scale up of orders of magnitude are nearly always wrong. I'd argue Reddit is a perfect example of a site that's clearly running into scaling issues but are locked into an architecture with few escape hatches built in

When a bridge fails it'll just fail. When software runs into a scaling limit it'll degrade and fall on its arse constantly and be absolutely terrible as long as it takes for the software team to completely rework their entire architecture often having to learn completely new technologies.

No, software isn't bridges. When bridges fail, people die. When reddit crashes, people take a piss and pet the cat.

I don't know the relevant details about reddit but the assumption that the early reddit people could have easily built something more scalable yet there are tech reasons why the later people, with far more resources, can't.

As to the assumption that one knows the important bottlenecks 2-3 orders of magnitude in advance, that's just wrong.

"late answers are wrong answers" isn't just for real-time. That applies to products as well as signals.

Technical debt is not necessarily a bad thing.

Your examples aren't equivalent. You're comparing the failure of a prototype or proof-of-concept under stress to the failure of production models under expected conditions.

The real risk you're incurring by deploying a prototype to production until traction is demonstrated and scaling becomes an issue is that deferring the design of the 'real' system risks succumbing to second-system syndrome and prematurely trying to make it a 'platform'.

Roughly, the first version just has to work, the second has to scale, and the third (assuming you get there that soon) is the one that starts needing to be refactored/generalized into various reusable subsystems.

The second doesn't (necessarily) have to scale.

It has to handle the projected load, with some safety margin.

"projected until when?" you ask?

Well, your projections will be wrong. And, your needs will change, which means that you're going to be changing the system anyway.

That's where judgment comes in.

Suppose that each customer is worth $10/month and that the "not scalable" version can handle 1M customers. That's at least $5M MRR when the system is at 50%.

If the scalable version takes significantly longer to develop, you might choose to put it off until you have more money/resources, especially since there will probably be other changes. (And, you don't know where the actual bottlenecks are.)

(robot voice) Does /dev/null support sharding? Sharding is the secret ingredient in the web scale sauce.
it's impressive how well that video has aged with modern web dev.
The answer is unfortunately less clear cut. Particularly if you assume that whoever is tasked with scaling this hypothetical DB doesn't know what they are doing a-priori.

The following questions are likely to come up

1) My t3.xl DB is down, how much bigger can I make it?

2) My r3.24xl DB can only handle 100 TPS and now my site is down, what can I do?

3) My 2x r3.24xl DB cluster costs a lot of money, Are other solutions cheaper?

4) My latency is high, are other solutions faster?

For someone who hasn't dealt with these questions before, these will become long and painful lessons with massive material impacts to the business.

It's appealing to use Dynamo as it takes the magic out of scaling. It's appealing to use serverless RDBMS as you don't have to think about it anymore unless it has high costs/latency.

> The answer is unfortunately less clear cut. Particularly if you assume that whoever is tasked with scaling this hypothetical DB doesn't know what they are doing a-priori.

The answer is very clear-cut:

Work with professionals.

The number of professionals who know and care about scaling who also want to work on small scale applications is relatively small. Hiring them will pose a challenge.
The amount of data being collected in the world is growing much faster than the number of database engineers is.
Why would you assume that the person responsible for a thing doesn't know what they're doing?
Unfortunately, easy to hit that with say GraphQL where each client request can resolve to dozens of db selects vs a single hand written/tuned SQL select.
Maybe that's a good reason to avoid GraphQL then?
That's not the reason, there are many but this a nice feature where a client can send one request and it will resolve internally (so less client side requests), you can request only the fields you need and it's easier when hitting multiple micro services.

Now you get into the issues of tracing, request failures and retries, payload size (only json? really..) etc.

But I do agree that a super optimized sql for a a specific purpose has it's performance benefit

In theory the graphql server could be smart enough to assemble a single SQL select to satisfy the query. Does anyone know of an implementation that does this?
If you're using GraphQL with SQL -- postgres specifically -- I would say use Hasura, but support between Hasura & CockroachDB seems to have stalled due to missing triggers [0] [1]. CRDB supports a feature called "changefeeds" [2] which is claimed might cover some of Hasura's use-cases, but that's a proprietary extension not present in base PostgreSQL.

[0]: https://github.com/hasura/graphql-engine/issues/678

[1]: https://github.com/cockroachdb/cockroach/issues/28296

[2]: https://www.cockroachlabs.com/docs/v21.1/stream-data-out-of-...

Any views on which to choose between Prisma, Hasura or postgraphile for GraphQL and postgres?
I feel like you forgot SQLite, and for really small scale you could just use whatever storage/serialization solution comes built in to your favorite language.
And the rational that validates this was written back around 2006[0]

[0] https://web.archive.org/web/20090306191715/http://www.my-idc...

Side note: this guy is Dominic Szablewski, author of the amazing ImpactJS, JSMpeg, QuakeVR. Probably a 10x developer. He rivals Fabrice Bellard of ffmpeg fame. Check out his site: https://phoboslab.org/
The question isn’t whether you have 10k req/s now, but whether you expect to in the future. If you are designing a blog, then yeah, you probably don’t need to worry about it. If you are starting a social network or SAAS business application, then you probably do.
The future will be different.

A lot of successful businesses start with things that are not scalable, and it is a strength, not a weakness. If you start a social network for instance, you can't beat Facebook at its own game. You have to do something that Facebook can't do because it is too big. Scalability problems will be tackled as you grow.

Among the many things Facebook can't do is running its service on a single database. It makes things much harder on them, thankfully, you are much smaller than Facebook and you can. Take that advantage.

Alternatively, people really do need to put a lot of thought into scaling, but only because they did something like write some core web service in an interpreted language framework that maxes out at 100 requests per second.
But wouldn't you agree that the initial architecture is really important; It should be at least designed with scaling in mind? Since it can get difficult changing things afterwards.
Sure if "in mind" includes realistic capacity planning.
I disagree. Even the simplest master-slave setup bound to cause an outage once or twice.

Now, if you setup your DB using public cloud's flavor of PostgreSQL... That's a different story.