Hacker News new | ask | show | jobs
by czx4f4bd 1106 days ago
I wonder if there's been any observable correlation between JSON support in the major SQL databases and the decreased (or increased?) adoption of NoSQL document databases like MongoDB. It would be interesting to do some bulk analysis on GitHub commits to compare their use over time.
3 comments

Just one bit of personal experience, but for me it was a significant reason. In most cases you want objects to have highly structured data (e.g. for joins and queries) and in other cases you just want "a bunch of semi-structured stuff". Sure, DBs always had blobs and text, but JSON is really what you want a lot of the time.

There's also a good article by Martin Fowler about how "NoSQL" was really "NoDBA" for a lot of folks, and I definitely saw that dynamic. JSON fields can also be a good middle ground here, where a DBA can insure good "structural integrity" of your schema, but you don't need to go through the hassle of adding a new column and schema update if you're just adding some "trivial" bit of data.

The canonical example for me, is when you want to store/use additional payment processor details for a transaction... If it's direct CC, PayPal, Amazon Payments etc. Relationally you only really care that the amount of the transaction was sent/received/accepted. But you may want to store the additional details, without a series of specific tables per payment processor. If you need to see the extra details that can still be done at runtime.

Another good example is for generalized classified ads, different categories may have additional details, but you don't necessarily want to create the plethora of tables to store said additional details.

Honestly, I pretty much always want structure. The reasons I've opted for NoSQL are almost always that cloud providers offer it for practically free while managed SQL databases are wayyyy more expensive. The nice thing about JSON is that it's a lot more ergonomic, but not because of the lack of typing--I would absolutely use a database that let my write reasonable type constraints for JSON columns. (I realize that you're talking about why most people use NoSQL and I'm remarking about why I use NoSQL).

Some other controversial thoughts: SQL itself is a really not-ergonomical query language, and also the lack of any decent Rust-like enum typing is really unfortunate. I know lots of people think that databases aren't for typing, but (1) clearly SQL aspires toward that but gives up half way and (2) that's a shame because they have a lot of potential in that capacity. Also while you can sort of hack together something like sum types / Rust enums, it's a lot of work to do it reasonably well and even then there are gaps.

Not sure I understand what you mean, or rather that all of this appears to be available in postgres.

pg_jsonschema is a postgres extension that implements schema validation for JSON columns. I'm not particularly familiar with Rust, so not sure exactly what you mean by "Rust-like enum typing", but postgres has enums, composite types, array types, and custom scalars, so not sure what's missing.

By "Rust-like enums", I mean "sum types" or "algebraic data types". In general, it's a way of saying that a piece of data can have one of several different types/shapes (whereas a Postgres enum is basically just a label backed by an int). But yeah, with jsonschema you can probably express sum types, but jsonschema is disappointing for a bunch of reasons and needing an extension is also not great.
Every ecosystem I've ever worked in has had good tooling for managing DB migrations (and in some cases I've been the one to add it). It's trivial to write a migration to ALTER TABLE bar ADD COLUMN foo and I think keeping structure explicit is generally quite beneficial for data safety even if you're not doing fancy things. DBAs are great but most companies simply don't need one - you can just get by with some pretty rudimentary SQL and skill up as needed.

Assuming you've got good integration test coverage of the database schema alterations end up taking a minuscule amount of time and if you lack test coverage than please reconsider and add more tests.

Completely disagree. The issue is not about really about how hard or easy it is to run migrations (every project I've worked on has also used migration files), it's that, depending on the data, it can just be a total waste of time.

Sibling comment, "is when you want to store/use additional payment processor details for a transaction", is a great example IMO. I've dealt with card processing systems where the card transaction data can be reams of JSON. Now, to be clear, there are a lot of subfields here that are important that I do pull out as columns, but a lot of them are just extra custom metadata specific to the card network. When I'm syncing data from another API, it's awesome that I can just dump the whole JSON blob in a single field, and then pull out the columns that I need. Even more importantly, by sticking the API object blob in a single field, unchanged, it guarantees that I have the full set of data from the API. If I only had individual columns, I'd be losing that audit trail of the API results, and if, for example, the processor added some fields later, I wouldn't be able to store them without updating my DB, too.

Before JSON columns were really standard, saw lots of cases where people would pull down external APIs into something like mongo, then sync that to a relational DB. Tons of overhead for a worse solution where instead I can just keep the source JSON blob right next to my structured data in postgres.

I think when you really need/want a DBA is when you're at a point where either you need redundancy/scale or have to remain up. Most developers aren't going to become that familiar with the details of maintenance and scale for any number of different database platforms. I think MS-SQL does better than most at enabling the developer+dba role, but even then there's a lot of relatively specialized knowledge. More so with the likes of Oracle or DB2.
Of course, if you're using Oracle or DB2 you have other/bigger problems.
MongoDB remains the 5th most popular database: https://db-engines.com/en/ranking

And there are four major reasons still to choose MongoDB over something like PostgreSQL.

a) PostgreSQL has terrible support for horizontal scalability. Nothing is built-in, proven or supported.

b) MongoDB has superior ability to manipulate and query the JSON.

c) MongoDB is significantly faster for document-attribute updates.

d) MongoDB has better tooling for those of us that prefer to manage our schema in the application layer.

By the time you need to shard PostgreSQL (billions of records?), you have lots and lots of resources to overcome that difficulty, a la Notion.
If you want to be high-availability then you need sharding or something like it from day 1. There's still no first-class way of running PostgreSQL that doesn't give you at least a noticeable write outage from a single-machine failure.
> If you want to be high-availability then you need sharding or something like it from day 1

Sharding has nothing to do with high-availability.

You horizontally scale for high availability as well as scalability.

And primary-secondary failover in my experience is rarely without issues.

There is a reason almost every new database aims to be distributed from the beginning.

>> There is a reason almost every new database aims to be distributed from the beginning.

That's partly because you can't compete with the existing RDBMSs if you're single node: they are good enough already. Nobody will buy your database if you don't introduce something more novel than PostgreSQL, whether that novelty is worth it or not.

Primary-secondary is simple and robust. If I had a dollar for every time I saw split-brain clusters....

---

And to respond sibling comment about "noticeable" downtime....

Primary-secondary failover in <1m is very feasible. And each minute downtime is a mere 0.002% for the month.

Primary-secondary isn't what is hurting your availability.

The experience for at least some of us is that failover is not robust. At all. And that < 1m is best case scenario that still requires a person to be monitoring the process.

And given that the entire industry has moved to a distributed model despite its complexity gives you a hint as to which way the wind has been blowing for the last decade.

You don't need to be that arrogant. The number-one reason why there are no new (No)SQL-Databases for one node is that the existing databases are great and you can't monetize them.

Failover is automatic for PG when using e.g. Patroni. Of course you lose active transactions and that might be a showstopper, but monitoring failover? I am curious when you'll have to do that.

a) not true b) not true c) not true d) not true e) a lot of people have no idea json support exists in PostgreSQL.
Agreed, when you see the index size in Mongo vs PostgreSQL, you will quickly understand that a single PostgreSQL instance can outscale a huge Mongo cluster.
PostgreSQL isn't the only RDMS to chose from.
You would have to tell the decreased adoption of NoSQL due to JSON support in major SQL databases apart then from the decreased adoption of NoSQL due to the hype being over...