I'm a pretty average SQL user, but I've heard expert consultants say before that they could do many more things with SQL databases that developers like me would have maybe grabbed a different tool for, like a graph database. So this then makes me wonder, once there's even broader adoption through PGQ, is that going to be a killer for niche databases like Neo4j, in favor of, say, Postgres?
Graph databases are about as different from RDBMSes storage-wise, as column-stores are from row-stores. It comes down to how you plan to shard data and distribute queries when data doesn't fit on a single node.
Using a graph DB with many underlying KV-store nodes, you can have a single graph spread over many machines representing e.g. Facebook's social graph, and run a query which "chases around" edges between vertices that live on different nodes, to solve that query, while ensuring that as little of that has to happen as possible — both by rebalancing vertices so that data is sharded at low-connection-degree points in the graph; and by consolidating the steps of queries that occur on the same node into single batch queries, such that the whole thing becomes (close to) a single map/reduce step.
There's nothing in Postgres that knows how to do that; if you had e.g. a graph stored in a Citus hypertable, and did a recursive CTE over it to do graph search, then you'd get pretty dang bad perf.
You can have data that fits on one machine and still run multiple instances of postgres in a failover configuration, which will probably cover just about everyone (depending on your filesystem, disk for a single instance is essentially infinite, so I'm not actually sure what bottleneck would motivate you need to scale beyond this configuration).
> I'm not actually sure what bottleneck would motivate you need to scale beyond this configuration
It's usually not that data doesn't fit on one machine but that load on the database exceeds what one machine can serve. A failover configuration might enable you to use the spares for some read operations and take a little load of the primary, but you lose ACID semantics when you do that and it generally doesn't help you for long.
There is a reason the world moved on from failover architectures.
a) At some point you will have more data or more users than one instance can handle. And instead of simply adding another node you need to throttle usage in order to do a rolling upgrade. Which is far easier said than done and involves impact to the business.
b) With distributed databases you are constantly testing that everything works in Dev, Test etc environments. With failover you really only test it every now and again usually before you deploy to Production. And in most companies which are hopeless the testing will be guaranteed to be inadequate.
c) Vendors lie. They promise that failover will just work but in my experience it very often doesn't. Which is another reason why b) is such important to validate their claims.
Interestingly, when a place does get to the point where the single instance has capacity issues (after upgrading to EPYC and lots of flash drives) then other non-obvious stuff shows up too.
For example, at one place just over a year ago they were well into this territory. One of weird problems for them was with pgBadger's memory usage (https://github.com/darold/pgbadger). That's written in perl, which doesn't seem to do garbage collection well. So even on a reporting node with a few hundred GB's of ram, it could take more than 24 hours to do a "monthly" reporting run to analyse PG usage for the time period.
There wasn't a solution in place at the time I left, so they're probably still having the issue... ;)
Some of these arguments and “common knowledge” things are getting old. Everybody scaled up twenty year ago - hell Amazon used to brag that they used an HP Superdome or whatever.
Anyone with dogmatic opinions about this stuff need to be taken with a grain of salt. If you scale out PeopleSoft, your accounting system will exceed the value of your company. If you’re worried about webscaling your random app, that’s more wasting time navel gazing than accomplishing anything! :)
The PGQ part of SQL is designed in such a way that an RDBMS can certainly optimize its storage of graph tables (eventually). ("Graph tables" used by PGQ have DDL setup, per the Eisentraut overview: CREATE PROPERTY GRAPH graphname…) If Postgres embraces PGQ it would be easy to imagine they would also adopt some graph database style storage to back its indexing and query planners.
> There's nothing in Postgres that knows how to do that
Postgres uses foreign data wrappers and table partitions to achieve sharding of relational databases over many nodes. It's certainly possible to make the FDW layer smarter wrt. being able to distribute certain queries to the shards, including some recursive queries that happen to be of practical interest.
if you are Facebook scale, yes. But then you anyway will use a dedicated graph database.
But a lot of graph databases have sizes where they are fully replicated on each node and some small internal optimizations for the graph are good enough. Given the design of the graph query language (and what you can query, hint: not everything arbitrarily) there is no reason postgres can't gain some additional optimizations to do that effectively as long as it's not facebook scale.
SQL now having syntax for querying graph means Graph databases can support SQL, at least a subset of it, in a standardized way.
A big problem (IMHO) with graph databases is building "the" graph model, and the fact that it's easy to be faced with problems that don't suit a graph database. Something as simple as returning the distinct set of values for an attribute and count of vertices containing each value require going outside the graph model, so aren't composable very well in a property graph system. (There are other graphs besides property graphs, they will have their time someday.)
What you really want is to apply graph processing to data as it is. The SQL 2023 additions are a step in the right direction. I need to find a good detailed description of the constraints and semantics to assess how good it is.
"Kill" is a strong word, as Postgres's solid JSON support technically obsoleted MongoDB for most use cases, but Mongo is still around for various reasons.
I suspect if Postgres had a solid implementation of SQL/PCQ it would be a similar story for Neo4j.
I wonder if there's been any observable correlation between JSON support in the major SQL databases and the decreased (or increased?) adoption of NoSQL document databases like MongoDB. It would be interesting to do some bulk analysis on GitHub commits to compare their use over time.
Just one bit of personal experience, but for me it was a significant reason. In most cases you want objects to have highly structured data (e.g. for joins and queries) and in other cases you just want "a bunch of semi-structured stuff". Sure, DBs always had blobs and text, but JSON is really what you want a lot of the time.
There's also a good article by Martin Fowler about how "NoSQL" was really "NoDBA" for a lot of folks, and I definitely saw that dynamic. JSON fields can also be a good middle ground here, where a DBA can insure good "structural integrity" of your schema, but you don't need to go through the hassle of adding a new column and schema update if you're just adding some "trivial" bit of data.
The canonical example for me, is when you want to store/use additional payment processor details for a transaction... If it's direct CC, PayPal, Amazon Payments etc. Relationally you only really care that the amount of the transaction was sent/received/accepted. But you may want to store the additional details, without a series of specific tables per payment processor. If you need to see the extra details that can still be done at runtime.
Another good example is for generalized classified ads, different categories may have additional details, but you don't necessarily want to create the plethora of tables to store said additional details.
Honestly, I pretty much always want structure. The reasons I've opted for NoSQL are almost always that cloud providers offer it for practically free while managed SQL databases are wayyyy more expensive. The nice thing about JSON is that it's a lot more ergonomic, but not because of the lack of typing--I would absolutely use a database that let my write reasonable type constraints for JSON columns. (I realize that you're talking about why most people use NoSQL and I'm remarking about why I use NoSQL).
Some other controversial thoughts: SQL itself is a really not-ergonomical query language, and also the lack of any decent Rust-like enum typing is really unfortunate. I know lots of people think that databases aren't for typing, but (1) clearly SQL aspires toward that but gives up half way and (2) that's a shame because they have a lot of potential in that capacity. Also while you can sort of hack together something like sum types / Rust enums, it's a lot of work to do it reasonably well and even then there are gaps.
Not sure I understand what you mean, or rather that all of this appears to be available in postgres.
pg_jsonschema is a postgres extension that implements schema validation for JSON columns. I'm not particularly familiar with Rust, so not sure exactly what you mean by "Rust-like enum typing", but postgres has enums, composite types, array types, and custom scalars, so not sure what's missing.
By "Rust-like enums", I mean "sum types" or "algebraic data types". In general, it's a way of saying that a piece of data can have one of several different types/shapes (whereas a Postgres enum is basically just a label backed by an int). But yeah, with jsonschema you can probably express sum types, but jsonschema is disappointing for a bunch of reasons and needing an extension is also not great.
Every ecosystem I've ever worked in has had good tooling for managing DB migrations (and in some cases I've been the one to add it). It's trivial to write a migration to ALTER TABLE bar ADD COLUMN foo and I think keeping structure explicit is generally quite beneficial for data safety even if you're not doing fancy things. DBAs are great but most companies simply don't need one - you can just get by with some pretty rudimentary SQL and skill up as needed.
Assuming you've got good integration test coverage of the database schema alterations end up taking a minuscule amount of time and if you lack test coverage than please reconsider and add more tests.
Completely disagree. The issue is not about really about how hard or easy it is to run migrations (every project I've worked on has also used migration files), it's that, depending on the data, it can just be a total waste of time.
Sibling comment, "is when you want to store/use additional payment processor details for a transaction", is a great example IMO. I've dealt with card processing systems where the card transaction data can be reams of JSON. Now, to be clear, there are a lot of subfields here that are important that I do pull out as columns, but a lot of them are just extra custom metadata specific to the card network. When I'm syncing data from another API, it's awesome that I can just dump the whole JSON blob in a single field, and then pull out the columns that I need. Even more importantly, by sticking the API object blob in a single field, unchanged, it guarantees that I have the full set of data from the API. If I only had individual columns, I'd be losing that audit trail of the API results, and if, for example, the processor added some fields later, I wouldn't be able to store them without updating my DB, too.
Before JSON columns were really standard, saw lots of cases where people would pull down external APIs into something like mongo, then sync that to a relational DB. Tons of overhead for a worse solution where instead I can just keep the source JSON blob right next to my structured data in postgres.
I think when you really need/want a DBA is when you're at a point where either you need redundancy/scale or have to remain up. Most developers aren't going to become that familiar with the details of maintenance and scale for any number of different database platforms. I think MS-SQL does better than most at enabling the developer+dba role, but even then there's a lot of relatively specialized knowledge. More so with the likes of Oracle or DB2.
If you want to be high-availability then you need sharding or something like it from day 1. There's still no first-class way of running PostgreSQL that doesn't give you at least a noticeable write outage from a single-machine failure.
>> There is a reason almost every new database aims to be distributed from the beginning.
That's partly because you can't compete with the existing RDBMSs if you're single node: they are good enough already. Nobody will buy your database if you don't introduce something more novel than PostgreSQL, whether that novelty is worth it or not.
Agreed, when you see the index size in Mongo vs PostgreSQL, you will quickly understand that a single PostgreSQL instance can outscale a huge Mongo cluster.
You would have to tell the decreased adoption of NoSQL due to JSON support in major SQL databases apart then from the decreased adoption of NoSQL due to the hype being over...
AFAICT the idea is that you are not directly querying the tables as a graph, but you construct a graph "view" from the tables, and then query that graph using PCQ.
Correct. The one RD I know that has implemented this "graph view" concept is Oracle's. They did it first with PGQL extensions on top of say 19c. Interesting, but the execution seemed a bit unpolished.
Now with 23c, they are adopting PGQ and one should be able to interface through the regular JDBC connectors. Will see how it shapes up.
PGQ looks neat - create a "property graph" from a relational model, then query it via Cypher-like expressions. The best or the worst of both worlds, depending on implementation quality.
SQL:2023 is finished: Here is what's new - https://news.ycombinator.com/item?id=35562430 - April 2023 (153 comments)