Hacker News new | ask | show | jobs
NoSQL: The Baby and the Bathwater (brooker.co.za)
53 points by iforgetlogins01 1228 days ago
7 comments

I don't like NoSQL databases. "NoSQL" should be manna from heaven. It may be impossible to create a less pleasant language than SQL. It isn't composable, it isn't internally consistent, it isn't easy to parse, it claims to be declarative but the ordering of the clauses is completely rigid, it fights every attempt at writing testable or maintainable code. It is hard to read. It is hard to programatically generate. Everyone seems to be trying to develop systems that mean they don't have to write SQL code.

Add all that up and NoSQL should be fine. But then the actual NoSQL databases let people throw out all the concepts that make SQL so sticky. Schemaless databases are in the same class as goto statements - there are people I trust to use them to do amazing things. The projects I inherit to maintain are never written by those people. Transactions and consistency guarantees are necessary to get a group of programmers together writing a reliable application. The relational model is the best idea to come out of database research.

I wish NoSQL was a postgres extension. Instead we get MongoDB advocates. Bless them, but PostgreNoSQL would be so much better for all the use cases I have than MongoBD.

> It may be impossible to create a less pleasant language than SQL.

It’s not by the way, there’s an entire segment of the esolang space for which it’s the goal (starting with the canonical example that is INTERCAL).

But ignoring esolangs, there are lots of very unpleasant programming languages out there. M/MUMPS is a well known one (especially with “legacy” coding styles, or so I gather). I also consider XSLT to be abhorrent, especially given the nice underlying conceptual idea (not entirely dissimilar to SQL really, just worse).

About SQL, an idea I saw surface recently in a related discussion was how nice it’d be if databases could expose the data model interaction and allow building on that directly: most every SQL database compiles the query into some sort of bytecode (combining direct translation and planner information) to actually run it on its storage layer, some databases allow peeking at the bytecode (sqlite actually prints the bytecode as part of its EXPLAIN output) but I don’t know that any allows bytecode input.

TBF the bytecode is very much considered an internal detail, it can change a lot between versions and (most importantly) tends to be more or less completely unchecked, it relies on the compiler generating it being correct (not unlike cpython for instance).

MUMPS always reminds me of Perl code golfing.
I think the right way to think about NoSQL databases is you have an application where traffic is expected to be heavy enough that it's worth throwing out what SQL gives you for free and having a highly optimized solution where you deal with those problems yourself. Of course, it was a big enough trend that many people jumped on it without having a practical use for it.

Arguably psql does offer a NoSQL-like experience with JSON columns.

> Arguably psql does offer a NoSQL-like experience with JSON columns.

I hope not. JSON columns purge all the good things a relational database offers and keeps the SQL. It is the worst of every option.

That really should be the opposite of the NoSQL experience. Disbarred by the fact that SQL is involved.

JSON columns are very useful in some circumstances but you shouldn’t use them as a replacement for a database schema, and if I’m honest I’m yet to see anyone truly suggest that.
I use schema for everything relational, and put everything else (all the nice-to-know data) in JSONB. Works really well. Stacks like Rails lean into this with `store_accessor` which lets you designate a JSON column as storage for arbitrary data that looks and feels like a separate column.
I guess it depends what part of NoSQL you’re looking for, but if you want a bit of schemaless data it makes sense.
How about EdgeDB? Never used it, but I found it when looking for a similar thing to you after reading this article. It's built on top of PostreSQL, uses it as a storage engine and provides a NoSQL interface to it. It calls itself a "graph-relational database."
I agree completely.

When I first heard about NoSQL I thought some databases were finally introducing some replacement for abominable SQL. But no, they were just creating schemaless databases.

I'm so glad this sentiment is being spread more. So tired of explaining that when I complain about SQL I *emphatically do not mean relational databases in general*, only to have everyone counter with "but relational stuff is good!"
Your argument seems a bit childish to me. Like you never really cared to learn the tool.

SQL is relational algebra. Sure you can write it some other way. Lots of projects tried, none succeeded. Wonder why is that?

End of the day - if you just learn the syntax, the hard part will be in the logic. As it should be.

I can't tell if you are being sarcastic or not.

In case you aren't: I didn't state my argument at all. Unless you dug through my comment history to find my complaints. This is precisely the kind of nonsense you get in response to even stating that you dislike the language.

> SQL is relational algebra. Sure you can write it some other way.

The same applies to general purpose programming languages, yet they have improved immensely since COBOL.

> Lots of projects tried, none succeeded. Wonder why is that?

Because of inertia, the mixed userbase of SQL, and ORMs and things like LINQ. The fact that most projects avoid direct use of SQL when possible is telling enough on its own.

NoSQL really means NoRDMS. The SQL language, or lack there of, has little to do with it.

The unfortunate naming leads to this kind of confusion. Probably a technology should never be named for what it is not.

It originally was the name for a non-SQL RDBMS: https://en.wikipedia.org/wiki/Strozzi_NoSQL

It malappropriated in 2009.

This explains a lot about the GNU project.
The note about composability annoys me. Does it have composability akin to functional programming languages? No. But SQL isn't a functional programming language. Never has been. It's the world's most popular 4th generation language. (Most popular high level languages are 3rd gen in case a reader was wondering.)

With the lens that SQL is a DSL for set theory and set transformation (which it is), composability takes the form of views, temp tables, and CTEs. Not what you're looking for? That's fine. But they 100% make SQL composable.

I take issue with saying it's inconsistent as well. Are there warts? Sure! Like any language. But fundamentally inconsistent from the lens of DSL for set theory and transformation? No.

Folks keep trying to make relational system access like their favorite programming language, and it's not going to happen. Are folks' favorite general purpose programming languages 4th gen languages with an emphasis on set theory and transformation? No? Then stop trying to shoe horn it on! It only leads to frustration. Embrace the set theory and 90% of SQL's issues don't register as issues anymore.

As for NoSQL like MongoDB as a Postgres extension, that's what jsonb and its related operators and functions are for. Seriously. to_jsonb(…) and jsonb_populate_recordset(…) are seriously under-appreciated. They're like rocket fuel for JSON processing. Of course you're also welcome to keep a jsonb column lying around to keep things "schemaless", but experience has shown that defining your data (with types) rather than allowing free-form blobs in your database is better in the long term. You obviously see this already by your trust statement with regard to coworkers. "Better" meaning more performant and easier to maintain.

Just like thinking "functionally" takes training and practice to grok when you've been object-oriented all your life, that "set theory" takes training and practice as well. And it's so worth it. Feels like a goddamn superpower sometimes.

You can always use JSONB columns to get the Postgres experience.
On the one hand, FROM first makes conceptual sense. On the other, it doesn't offer anything significantly better than SQL. Is it marginally more terse? Yes. Is terseness paramount in a database manipulation and transformation language? I'm not convinced.
> Bless them, but PostgreNoSQL would be so much better for all the use cases I have than MongoBD

And for some of us MongoDB is a better option than PostgreSQL.

Many of us simply can't rely on scalability and high availability being something that isn't part of the core product.

The built in horizontal scaling of Mongo is nice, but I would still put PG's reliability per-instance higher than Mongo's.

The reason being that Mongo is written in C++ with lots of RAII and will simply die on a memory allocation failure. Postgres won't. It'll keep running in many low memory scenarios and the operation will error out.

In what way is scalability and high availability not part of Postgres' core?
Master-master replication probably, think Galera.
You would think after all these years people would stop using the term NoSQL. They are just databases.

Many of the so-called NoSQL ones like MongoDB or Cassandra can support schemas, transactions, strong consistency, joins, secondary indexes etc. And SQL databases like PostgreSQL support schema-less data structures.

And with Presto, Spark SQL etc you can use SQL with almost any data store.

Just because they support those things does it mean it’s straight forward to use them

Our team had a lot of trouble trying to map highly relational data to a noSQL database (mongo)

It could have been a failing of our team, but I also think it just made our lives way harder. We’re on Postgres now and a lot of issues have faded away.

Same experience here.

We had a small-ish application that was originally built in top of MongoDB. Once it made it into production and started to see some success, it became quickly apparent that the schemaless design caused problems that an RDBMS would have solved.

It was decided fairly quickly to remove the MongoDB underpinnings and migrate everything to Postgres. It was the right call, and the final nail in any remaining affection — and interest — I had for NoSQL stores.

Structure is important, that's where NoSQL fails from the get go. We should stop investing so much time into NoSQL and look more into combining SQL and Graphs
You may be interested in https://www.gqlstandards.org/ and https://www.iso.org/standard/79473.html

TL;DR the ISO standards committees behind SQL are working on bring graph query languages and SQL databases closer together.

Obviously there is work to be done at the storage/query planning layer after that, but I’m hopeful once the surface exists more widely that will drive more work in those areas.

You can use NoSQL with a schema and have strict enforcement. What the NoSQL db does is make storing objects easy with less boilerplate or abstraction.
How is NoSQL with schema validation and strict enforcement anything but SQL with extra steps?
The schema restrictions you can express in SQL are both insufficient and not flexible enough even for relatively trivial things, except you are either fine with storing and retrieving things in a completely different shape than you actually need, or if you are fine with constantly migrating your schema.

That's not a sufficient argument against adopting SQL for something, but other solutions are certainly _not_ just SQL with extra steps. Not everything maps nicely onto flat tables with predefined slots.

That seems a little like saying there's no reason for anyone to use a language without GC. Sure, for most applications it is fine.
That's a good analogy. SQL RDBMS's, just like Java and other languages with GC have huge utility in a multitude of enterprise software situations. These probably make up the majority of software systems and lines of code out there.

Then there are a few domains that NEED other solutions. In terms of CPU cycles these may be consuming more resources. These are operating systems, game engines, ML libraries, search engines, social media platforms and huge webshops like Amazon.

But to store all data in Spark or MongoDB just because that's what <insert Cargo Cult idol> is doing makes about as much sense at programming everything in C/C++.

I'm surprised to see so many opinions on what is "objectively" the best database paradigm. My incredibly boring take: databases are such a generic tool that it really depends on your use case, there are times when noSQL is stand out the best choice, and lots more when it isn't- the real pain point comes from people blindly cargo culting into a db paradigm without considering their use case.
Yup. I've got a (legacy) postgresql database with time series: one timestamp plus measurement per row. And there's a useless third column too. It's awful, and sluggish.

For a newer project, where we basically use one object, I've chosen a NoSQL database. Get the object, edit it in the browser, put it back. Done. No need to update relations, ORMs or any of that. But: that won't fly for more complex projects.

So I agree: pick the right tool for the job.

Hi tgv, curious to hear why you picked NoSQL for your time-series use case?
It's not a choice that's guaranteed to suit your use case, but storing a time series as a large set of rows is not performant. Perhaps I should have mentioned that our time series data is "write once": we record and store it, but it doesn't get altered. And the time series are all sent to the browser, which does the displaying, filtering and analyzing, so there's no point in processing (our particular) time series in SQL, if that were even feasible. So, basically, because it is good enough, and better than the available alternatives.
Sure SQL DBs can have all the advantages of NoSQL DBs and then some... but they will never have a lower price tag. And that's because NoSQL DBs use very, very little CPU. All that's needed is giving up reliance on SQL. Design your NoSQL "schema" with this in mind, and you're golden
Very little CPU depending on sorting, indexes, etc...

Have a lot of indexes on a collection? Mongo will eat 5-10ms of CPU time per query even with cached query plan stats just to start executing the query. So no different than PG here.

What you're getting at is Joins, but I haven't seen a company that didn't end up doing joins with Mongo at some point. Or, they do it in the app layer, in Java/Node/PHP which requires more CPU than it would in the DB.

Also, wanna store an array in a document in Mongo? Every time you add to that array, the whole array is replicated to your secondaries. That eats a lot of CPU.

But you are describing incorrect ways of using NoSQL. NoSQL requires getting the schema right from the get go. Many people don't like it because they are used to relying on SQL (or a language like you mentioned) to smooth things out. NoSQL requires you to ask yourself hard questions - what specific queries are going to consume this data and then model the data as opposed to in SQL where you just store the data and worry about querying it later. This produces efficient queries that ultimately don't need as much CPU
This is completely opposite to the raison d'être of NoSQL databases. They all started out without support for schemas because there was a desire to get away from the schema-upfront and rigidness of SQL databases. They were created to be able to drop in whatever you want as opposed to SQL databases that has strict rules about what goes in.
Differing structures don't necessarily require more compute to query.

Anyway this is a weird argument. Ideally you get your schema right from the get go no matter the DB, of course things will be easier. But we don't. That's why we have migrations. Also, what's right today, might not be after acquiring Jonny Big Corp.

The whole point of NoSQL is to dump documents into it, of varying structure, and query/index them as-needed.

Some claim the opposite, though. Even in this discussion. That NoSQL is needed when you need to change the schema in a production system.

Anyway, most systems that last more than a couple of years tend to change over time, meaning the schema is likely to need extensions.

Sure, changing the schema is easier in NoSQL. But to do it properly, you still need to really understand how the data is being queried and then change the schema in a controlled "migration". For most people these tasks will be a pain. But NoSQL can really shine in terms of $/query if data is being correctly looked after. It will save you from needing 16 cores in multiple clusters and very expensive bills
That's interesting.

A lot of your suggestions are actually exactly the same for SQL based databases. If your schema is not fit for the task at hand, it can slow it down by an order of magnitude, and the process of changing schema is also similar.

Though, i would think, a properly designed SQL database would need such full schema refactoring less often, since adding a few tables within the same structure is easier.

It sound like you're describing use cases with greater data volumes than I usually use SQL for. (Mostly Data Science use cases, where larger datasets typically end up in Spark, or similar)

My experience is that SQL based systems work best up to a few 100 millions of records in the largest tables (a few billion at most), and with transactions per second is less than about 10000. Around those volumes is where SQL start to get really expensive.

And often SQL is used for use cases where number of records per table of less than 10 million and transactions per second in the low hundreds or lower.

But I'm probably biased in the opposite direction that you are. For me, performance usually means efficient joins. Which means that even if I'm leaving RDBMS's behind, I still use SQL where I can (such as in Spark).

This is true, it's also very difficult to do in most business situations (getting the schema right from the get go).
It's hard but it's still possible to change it. It's just very important to do it carefully and knowing what the consequences are. I have had to iterate NoSQL schemas several times, and I always prefer do treat their evolution as migrations done out-of-hours
Which seems to imply a static schema, known more-or-less perfectly at the outset, and unchanging thereafter. That is obviously unreasonable. Application data structures change as new requirements arise, and as existing requirements become better understood.
> Which seems to imply a static schema, known more-or-less perfectly at the outset, and unchanging thereafter. That is obviously unreasonable.

Is it? Most of AWS runs on NoSQL databases, and continues to ship new features that do not fit into existing schema. This assertion then is clearly incorrect.

Most databases are still kinda crap. Either it's hard to scale them or maintain them, hard to know the right way to use them, or they strip out useful functionality for the sake of simplicity, forcing reinventing the wheel. We haven't got much actually novel design since like 2006. There's some people trying to bolt on top of/around/under existing databases for backwards compatibility or to avoid reinventing, but not really much actual novel design for a whole database.

Computer Science research seems to either be fully academic, where research into practical solutions doesn't happen, or it's limited to private companies trying to solve one business problem. Very little actual advancement of the state of the art, or solving of long standing limitations.

Explicit schema is the most important feature to drop.

You cannot have a live service that is interrupted by database modifications.

My solution is to use JSON.

But the most important feature to add is HTTP as transport and async-to-async (both client and server needs to only use a thread when they are doing work, zero idling): To scale a distributed database across continents:

2000 line distributed DB: http://root.rupy.se

> You cannot have a live service that is interrupted by database modifications.

So don't do that, then. Designing database migrations to be non-breaking is part of the game and if you're not doing it, you can't claim to understand the technology you're replacing. Not having an explicit schema doesn't mean you don't have to think about your schema. It just means you've chucked out all the tooling for keeping it sane.

It may be the most important feature to drop, for some.

It's certainly the most important feature to KEEP for others.

Schemas allow you to keep the more critical business relationships consistent. For variable data elements, you can always use JSON columns for that.