Hacker News new | ask | show | jobs
by cphoover 2881 days ago
NoSQL has been and continues to be hugely influential. All major cloud players provide document/object based storage, as well as other NoSQL Solutions. The term "NoSQL" was dumb and overhyped... But I think it's really about using the correct storage solution for the job.

Non relational data should be stored in a non rdbms. Key-Value stores like Redis are immensely useful as caching layers (but they offer so many more features). Graph databases can be used for data with complex relationships that are not easily modeled. They are also good for seeking strong correlations between related items. (think person A. called person B. called person C. (palantir type searches).). Searches can be done way more effectively in a specialized index, like an inverted index used by lucene/elasticsearch, which also supports things like stemming, synonyms, and numerous other features. These are all "NoSQL" NoSQL is not just mongodb (which isn't nearly as bad as people make it out to be btw).

Even traditional RDBMS are seeing an influx of NOSQLesque features. Like JSON types and operations in postgres.

The reason "NoSQL" dbs got popular are because in my experience monolithic large relational databases are hard to scale, and manage once they become too complex. When you have one large database with tons of interdependencies, it makes migrating data, and making schema changes much harder. This in my opinion is the biggest issue (moreso than performance problems associated with doing joins to the n-th degree., which is also an issue.)

It also makes separating concerns of the application more difficult when one SQL connection can query/join all entities. In theory better application design would have separate upstream data services fetch the resources they are responsible for. That data can be stored in a RDBMS or NOSQL, but NOSQL forces your hand in that direction.

As it goes for serverless, this just seems like a natural progression from containerization, I'm interested to see where the space goes.

Personally I think it's foolish to put your head in the sand when the industry is changing, or learning new concepts.

2 comments

> The reason "NoSQL" dbs got popular are because in my experience Monolithic large relational databases are hard to scale.

I've met a lot of people whomst thought they had to scale that big. Very few handled anything that couldn't run off a beefy postgres installation.

The purpose of a system is what it does. People don't use nosql to scale because they don't need to scale, so what does it do? People use nosql to not write schemas. That's what it's for, for the majority of users.

If I need a key value store, I use a key value store. There's no flashy paradigm there. If I need to put a container up on the interwebs, I do it. What's serverless? Nosql is an "idea", "paradigm", "revolution", or at least the branding of one. Just the same, serverless.

I will continue to ignore nosql and serverless.

The industry sure does change, but do you know how much of that is moving in a real direction and how much is a merry-go-round? Let's brand it "Carousel" and raise 10 million. And in 20 years we can talk about serverless being the new hotness, again.

> Very few handled anything that couldn't run off a beefy postgres installation.

My impression, from attempting to evangelize scaling "up" before scaling "out" (because it's both cheaper and much lower effort/labor/time) is that vanishingly few programmers have any idea what a "beefy" installation would even look like.

I routinely encounter implicit assumptions (partially driven, these days, anyway, by what VPS and cloud providers off) that the "largest" servers 2U (or 4U, if I'm lucky) and are I/O limited by the number of disks they an hold in their chassis.

Similarly, there seems to be a lack of awareness of just how big main memory can be on a single server, even before paying a price premium for higher-density modules.

Not knowing where the price-performance curve inflection points (for memory and/or CPU) happen to be also seems to be associated with not knowing where the price tops out. It's as if they fear the biggest server they can (and will be forced to) buy will cost a million bucks, rather than $100k.

Scale is not just user load, but also scale of application complexity. In my experience when one db connection has access to every resource, in a complex application, this can lead to some really convoluted queries and make schema changes very difficult because of cross cutting dependencies built into these queries, triggers, procedures... etc. This is forgetting about the issues of deadlocks when you have 80 consuming services and applications you don't even know about are opening up all sorts of transactions. Even just splitting the DB into schemas for each resource domain and limiting access per service can help to avoid this.

Also performance is relative, I've worked on highly trafficked applications that had to support high throughput. I have also worked on applications backed by relational storage where data size and complexity has impacted performance.

> "Scale is not just user load, but also scale of application complexity"

In my experience, when people use NoSQL because "the application is too complex for relational DBs" they tend to make a mess of it, NoSQL included. They usually end up reinventing the wheel and re-writing buggy versions of features a RDBMS would have given them natively.

Been there, done that, migrated everything back to Postgres and saw huge gains.
I don't think I've seen a deadlock in a long long time on most major DB platforms.

PG also lets you get very vague about it being an relational DB if you want.

And tbh, if the size of your table impacts performance, you either don't have a very good DBA or your DBA doesn't know what partitioning is, both good reasons to replace them.

Most modern DBs don't have any of these issues, PG can cleanly handle live schema changes since it packs those in transactions. Old transactions simply use the previous schema. MariaDB requires a bit more fiddling but Github figured it out.

And from experience, you're likely not going to hit the scale where you need multiple DB nodes for performance. In 10 out of 10 cases, a simple failover is what you need (but didn't invest in because MongoDB is cooler).

> when one db connection has access to every resource

So why not use db users to restrict each part to only be able to access the parts it should?

Sure that works... I think encapsulation through separate db schemas is generally sufficient. Most people don't start or end up here however. I'm not saying that RDBMS used correctly is a bad thing. I prefer multiple small postgres schemas per "data service" (what I'm calling a service that deals only with data persistence, and updating consumers about changes to data), each schema can correlate to a single resource, or smallest possible domain of the application. These services can publish notifications about updates that can be consumed by consuming downstream services.

It's my opinion micro-services, should do one thing and do them well, and the data storage that backs these services should only be concerned with the domain of that single-purpose service. It should be isolated from all other concerns.

Having a separate schema for "users" than for "messages" for example.

Where to draw those dividing lines is not always easy.

Very much this. Sooooo many times I hear the cry of "does it scale?" To which I reply, "Does it need to?!"

At my last company we had a developer question scalability constantly despite the fact that the average customer of an instance of our product had about 200 users.

I like to add, "does it need to beyond what's delivered by Moore's Law?" (which I use a metaphor for all increases in computing performance, including I/O, which has, of course, increased at a much slower, but far from zero, pace).

If your CPU utilization from user growth is doubling every 2 years, but so is CPU capacity, then don't worry about it.

> Very few handled anything that couldn't run off a beefy postgres installation.

Beefy postgres would get you to 99.9% availability at best, with pretty bad tail latency and would cost quite a bit to operate. As it turns out, very few can actually live with that. And even infamous MongoDB can do better at this than PostgreSQL. Ignorance simply makes your business less competitive.

> Beefy postgres would get you to 99.9% availability at best

This is just false. Shrug.

> monolithic large relational databases are hard to scale

DB2 on z/OS was able handle billions of queries per day.

In 1999.

Some greybeards took great delight in telling me this sometime around 2010 when I was visiting a development lab.

> When you have one large database with tons of interdependencies, it makes migrating data, and making schema changes much harder.

Another way to say this is that when you have a tool ferociously and consistently protecting the integrity of all your data against a very wide range of mistakes, you have to sometimes do boring things like fix your mistakes before proceeding.

> In theory better application design would have separate upstream data services fetch the resources they are responsible for.

A join in the application is still a join. Except it is slower, harder to write, more likely to be wrong and mathematically guaranteed to run into transaction anomalies.

I think non-relational datastores have their place. Really. There are certain kinds of traffic patterns in which it makes sense to accept the tradeoffs.

But they are few. We ought to demand substantial, demonstrable business value, far outweighing the risks, before being prepared to surrender the kinds of guarantees that a RDBMS is able to provide.

Not everything requires pessimistic transactional guarantees or atomicity. The problem domain you are solving for will influence the importance of those guarantees. If I'm solving for something where data consistency is not an utmost priority (tons of applications meet this criteria, including the one you are using now HN.) I don't have to worry about this.

But when you have transactional guarantees you also lose partition/failure tolerance. So it ends up being a choice of consistency over availability.

> Not everything requires pessimistic transactional guarantees or atomicity.

They are easier to give up after the fact than to try to regain after the fact.

> If I'm solving for something where data consistency is not an utmost priority (tons of applications meet this criteria, including the one you are using now HN.) I don't have to worry about this.

Sure. But wait for the pain. Prove the business need to relax the guarantees and the business acceptance of the risks.

> So it ends up being a choice of consistency over availability.

Total partitions are relatively rare and so disruptive that even if the magical datastore keeps chugging, everything else is mostly boned, so it doesn't matter. Meanwhile people tend to discover that actually, consistency mattered all along, but it's impossible to fix in retrospect.

Then there's the whole thing of bold claims being made in theory and not delivered in reality. RDBMSes, with the exception of MySQL which is close to being singlehandedly responsible for the emergence of NoSQL in the first place, tend to actually deliver on what they promise. The record for the alternatives is mixed, the fine print varies wildly and tends to leave out important details like "etcd split brains if you sneeze too loudly" or "mongodb is super fast, unless you want your data back".