| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MarkusWinand 1156 days ago

The major news are:

- SQL/PGQ - A Graph Query Language

- JSON improvements (a JSON type, simplified notations)

Peter Eisentraut gives a nice overview here: https://peter.eisentraut.org/blog/2023/04/04/sql-2023-is-fin...

11 comments

dang 1156 days ago

Discussed here:

SQL:2023 is finished: Here is what's new - https://news.ycombinator.com/item?id=35562430 - April 2023 (153 comments)

link

bokchoi 1156 days ago

Thanks! Lots of little neat improvements in there like accessing JSON values using dots and array syntax:

    SELECT t.j.foo.bar[2], ... FROM tbl t ...

link

sverhagen 1156 days ago

I'm a pretty average SQL user, but I've heard expert consultants say before that they could do many more things with SQL databases that developers like me would have maybe grabbed a different tool for, like a graph database. So this then makes me wonder, once there's even broader adoption through PGQ, is that going to be a killer for niche databases like Neo4j, in favor of, say, Postgres?

link

derefr 1156 days ago

Graph databases are about as different from RDBMSes storage-wise, as column-stores are from row-stores. It comes down to how you plan to shard data and distribute queries when data doesn't fit on a single node.

Using a graph DB with many underlying KV-store nodes, you can have a single graph spread over many machines representing e.g. Facebook's social graph, and run a query which "chases around" edges between vertices that live on different nodes, to solve that query, while ensuring that as little of that has to happen as possible — both by rebalancing vertices so that data is sharded at low-connection-degree points in the graph; and by consolidating the steps of queries that occur on the same node into single batch queries, such that the whole thing becomes (close to) a single map/reduce step.

There's nothing in Postgres that knows how to do that; if you had e.g. a graph stored in a Citus hypertable, and did a recursive CTE over it to do graph search, then you'd get pretty dang bad perf.

link

MarkusWinand 1156 days ago

> It comes down to how you plan to shard data and distribute queries when data doesn't fit on a single node.

A problem everbody would love to have but pretty much nobody actually has.

link

jupp0r 1156 days ago

> A problem everbody would love to have

Except the people who do have it and need to keep their business running off of one postgres instance.

link

throwaway894345 1156 days ago

You can have data that fits on one machine and still run multiple instances of postgres in a failover configuration, which will probably cover just about everyone (depending on your filesystem, disk for a single instance is essentially infinite, so I'm not actually sure what bottleneck would motivate you need to scale beyond this configuration).

link

jupp0r 1156 days ago

> I'm not actually sure what bottleneck would motivate you need to scale beyond this configuration

It's usually not that data doesn't fit on one machine but that load on the database exceeds what one machine can serve. A failover configuration might enable you to use the spares for some read operations and take a little load of the primary, but you lose ACID semantics when you do that and it generally doesn't help you for long.

link

threeseed 1156 days ago

There is a reason the world moved on from failover architectures.

a) At some point you will have more data or more users than one instance can handle. And instead of simply adding another node you need to throttle usage in order to do a rolling upgrade. Which is far easier said than done and involves impact to the business.

b) With distributed databases you are constantly testing that everything works in Dev, Test etc environments. With failover you really only test it every now and again usually before you deploy to Production. And in most companies which are hopeless the testing will be guaranteed to be inadequate.

c) Vendors lie. They promise that failover will just work but in my experience it very often doesn't. Which is another reason why b) is such important to validate their claims.

link

justinclift 1156 days ago

Interestingly, when a place does get to the point where the single instance has capacity issues (after upgrading to EPYC and lots of flash drives) then other non-obvious stuff shows up too.

For example, at one place just over a year ago they were well into this territory. One of weird problems for them was with pgBadger's memory usage (https://github.com/darold/pgbadger). That's written in perl, which doesn't seem to do garbage collection well. So even on a reporting node with a few hundred GB's of ram, it could take more than 24 hours to do a "monthly" reporting run to analyse PG usage for the time period.

There wasn't a solution in place at the time I left, so they're probably still having the issue... ;)

link

threeseed 1156 days ago

I consult for a lot of companies and I never heard of or seen a database that wasn't horizontally scaled.

It's not for scalability reasons it's for high-availability.

Which as cloud adoption has increased and server uptime has decreased is even more important.

link

Spooky23 1156 days ago

Some of these arguments and “common knowledge” things are getting old. Everybody scaled up twenty year ago - hell Amazon used to brag that they used an HP Superdome or whatever.

Anyone with dogmatic opinions about this stuff need to be taken with a grain of salt. If you scale out PeopleSoft, your accounting system will exceed the value of your company. If you’re worried about webscaling your random app, that’s more wasting time navel gazing than accomplishing anything! :)

link

jahewson 1156 days ago

Why shard when you can just replicate?

link

lmm 1156 days ago

Because replica failover is rarely seamless (and often doesn't actually work at all, IME).

link

WorldMaker 1156 days ago

The PGQ part of SQL is designed in such a way that an RDBMS can certainly optimize its storage of graph tables (eventually). ("Graph tables" used by PGQ have DDL setup, per the Eisentraut overview: CREATE PROPERTY GRAPH graphname…) If Postgres embraces PGQ it would be easy to imagine they would also adopt some graph database style storage to back its indexing and query planners.

link

zozbot234 1156 days ago

> There's nothing in Postgres that knows how to do that

Postgres uses foreign data wrappers and table partitions to achieve sharding of relational databases over many nodes. It's certainly possible to make the FDW layer smarter wrt. being able to distribute certain queries to the shards, including some recursive queries that happen to be of practical interest.

link

dathinab 1156 days ago

if you are Facebook scale, yes. But then you anyway will use a dedicated graph database.

But a lot of graph databases have sizes where they are fully replicated on each node and some small internal optimizations for the graph are good enough. Given the design of the graph query language (and what you can query, hint: not everything arbitrarily) there is no reason postgres can't gain some additional optimizations to do that effectively as long as it's not facebook scale.

SQL now having syntax for querying graph means Graph databases can support SQL, at least a subset of it, in a standardized way.

link

quantified 1156 days ago

A big problem (IMHO) with graph databases is building "the" graph model, and the fact that it's easy to be faced with problems that don't suit a graph database. Something as simple as returning the distinct set of values for an attribute and count of vertices containing each value require going outside the graph model, so aren't composable very well in a property graph system. (There are other graphs besides property graphs, they will have their time someday.)

What you really want is to apply graph processing to data as it is. The SQL 2023 additions are a step in the right direction. I need to find a good detailed description of the constraints and semantics to assess how good it is.

link

pphysch 1156 days ago

"Kill" is a strong word, as Postgres's solid JSON support technically obsoleted MongoDB for most use cases, but Mongo is still around for various reasons.

I suspect if Postgres had a solid implementation of SQL/PCQ it would be a similar story for Neo4j.

link

czx4f4bd 1156 days ago

I wonder if there's been any observable correlation between JSON support in the major SQL databases and the decreased (or increased?) adoption of NoSQL document databases like MongoDB. It would be interesting to do some bulk analysis on GitHub commits to compare their use over time.

link

hn_throwaway_99 1156 days ago

Just one bit of personal experience, but for me it was a significant reason. In most cases you want objects to have highly structured data (e.g. for joins and queries) and in other cases you just want "a bunch of semi-structured stuff". Sure, DBs always had blobs and text, but JSON is really what you want a lot of the time.

There's also a good article by Martin Fowler about how "NoSQL" was really "NoDBA" for a lot of folks, and I definitely saw that dynamic. JSON fields can also be a good middle ground here, where a DBA can insure good "structural integrity" of your schema, but you don't need to go through the hassle of adding a new column and schema update if you're just adding some "trivial" bit of data.

link

tracker1 1156 days ago

The canonical example for me, is when you want to store/use additional payment processor details for a transaction... If it's direct CC, PayPal, Amazon Payments etc. Relationally you only really care that the amount of the transaction was sent/received/accepted. But you may want to store the additional details, without a series of specific tables per payment processor. If you need to see the extra details that can still be done at runtime.

Another good example is for generalized classified ads, different categories may have additional details, but you don't necessarily want to create the plethora of tables to store said additional details.

link

throwaway894345 1156 days ago

Honestly, I pretty much always want structure. The reasons I've opted for NoSQL are almost always that cloud providers offer it for practically free while managed SQL databases are wayyyy more expensive. The nice thing about JSON is that it's a lot more ergonomic, but not because of the lack of typing--I would absolutely use a database that let my write reasonable type constraints for JSON columns. (I realize that you're talking about why most people use NoSQL and I'm remarking about why I use NoSQL).

Some other controversial thoughts: SQL itself is a really not-ergonomical query language, and also the lack of any decent Rust-like enum typing is really unfortunate. I know lots of people think that databases aren't for typing, but (1) clearly SQL aspires toward that but gives up half way and (2) that's a shame because they have a lot of potential in that capacity. Also while you can sort of hack together something like sum types / Rust enums, it's a lot of work to do it reasonably well and even then there are gaps.

link

hn_throwaway_99 1156 days ago

Not sure I understand what you mean, or rather that all of this appears to be available in postgres.

pg_jsonschema is a postgres extension that implements schema validation for JSON columns. I'm not particularly familiar with Rust, so not sure exactly what you mean by "Rust-like enum typing", but postgres has enums, composite types, array types, and custom scalars, so not sure what's missing.

link

throwaway894345 1155 days ago

By "Rust-like enums", I mean "sum types" or "algebraic data types". In general, it's a way of saying that a piece of data can have one of several different types/shapes (whereas a Postgres enum is basically just a label backed by an int). But yeah, with jsonschema you can probably express sum types, but jsonschema is disappointing for a bunch of reasons and needing an extension is also not great.

link

munk-a 1156 days ago

Every ecosystem I've ever worked in has had good tooling for managing DB migrations (and in some cases I've been the one to add it). It's trivial to write a migration to ALTER TABLE bar ADD COLUMN foo and I think keeping structure explicit is generally quite beneficial for data safety even if you're not doing fancy things. DBAs are great but most companies simply don't need one - you can just get by with some pretty rudimentary SQL and skill up as needed.

Assuming you've got good integration test coverage of the database schema alterations end up taking a minuscule amount of time and if you lack test coverage than please reconsider and add more tests.

link

hn_throwaway_99 1156 days ago

Completely disagree. The issue is not about really about how hard or easy it is to run migrations (every project I've worked on has also used migration files), it's that, depending on the data, it can just be a total waste of time.

Sibling comment, "is when you want to store/use additional payment processor details for a transaction", is a great example IMO. I've dealt with card processing systems where the card transaction data can be reams of JSON. Now, to be clear, there are a lot of subfields here that are important that I do pull out as columns, but a lot of them are just extra custom metadata specific to the card network. When I'm syncing data from another API, it's awesome that I can just dump the whole JSON blob in a single field, and then pull out the columns that I need. Even more importantly, by sticking the API object blob in a single field, unchanged, it guarantees that I have the full set of data from the API. If I only had individual columns, I'd be losing that audit trail of the API results, and if, for example, the processor added some fields later, I wouldn't be able to store them without updating my DB, too.

Before JSON columns were really standard, saw lots of cases where people would pull down external APIs into something like mongo, then sync that to a relational DB. Tons of overhead for a worse solution where instead I can just keep the source JSON blob right next to my structured data in postgres.

link

tracker1 1156 days ago

I think when you really need/want a DBA is when you're at a point where either you need redundancy/scale or have to remain up. Most developers aren't going to become that familiar with the details of maintenance and scale for any number of different database platforms. I think MS-SQL does better than most at enabling the developer+dba role, but even then there's a lot of relatively specialized knowledge. More so with the likes of Oracle or DB2.

link

psidebot 1156 days ago

Of course, if you're using Oracle or DB2 you have other/bigger problems.

link

threeseed 1156 days ago

MongoDB remains the 5th most popular database: https://db-engines.com/en/ranking

And there are four major reasons still to choose MongoDB over something like PostgreSQL.

a) PostgreSQL has terrible support for horizontal scalability. Nothing is built-in, proven or supported.

b) MongoDB has superior ability to manipulate and query the JSON.

c) MongoDB is significantly faster for document-attribute updates.

d) MongoDB has better tooling for those of us that prefer to manage our schema in the application layer.

link

paulddraper 1156 days ago

By the time you need to shard PostgreSQL (billions of records?), you have lots and lots of resources to overcome that difficulty, a la Notion.

link

lmm 1156 days ago

If you want to be high-availability then you need sharding or something like it from day 1. There's still no first-class way of running PostgreSQL that doesn't give you at least a noticeable write outage from a single-machine failure.

link

hans_castorp 1156 days ago

> If you want to be high-availability then you need sharding or something like it from day 1

Sharding has nothing to do with high-availability.

link

threeseed 1156 days ago

You horizontally scale for high availability as well as scalability.

And primary-secondary failover in my experience is rarely without issues.

There is a reason almost every new database aims to be distributed from the beginning.

link

dikei 1156 days ago

>> There is a reason almost every new database aims to be distributed from the beginning.

That's partly because you can't compete with the existing RDBMSs if you're single node: they are good enough already. Nobody will buy your database if you don't introduce something more novel than PostgreSQL, whether that novelty is worth it or not.

link

paulddraper 1156 days ago

Primary-secondary is simple and robust. If I had a dollar for every time I saw split-brain clusters....

---

And to respond sibling comment about "noticeable" downtime....

Primary-secondary failover in <1m is very feasible. And each minute downtime is a mere 0.002% for the month.

Primary-secondary isn't what is hurting your availability.

link

throwaway2990 1156 days ago

a) not true b) not true c) not true d) not true e) a lot of people have no idea json support exists in PostgreSQL.

link

olavgg 1155 days ago

Agreed, when you see the index size in Mongo vs PostgreSQL, you will quickly understand that a single PostgreSQL instance can outscale a huge Mongo cluster.

link

pjmlp 1156 days ago

PostgreSQL isn't the only RDMS to chose from.

link

sverhagen 1156 days ago

You would have to tell the decreased adoption of NoSQL due to JSON support in major SQL databases apart then from the decreased adoption of NoSQL due to the hype being over...

link

pphysch 1156 days ago

See also [1] for how this (might) relate to PostgreSQL

In particular it is nice to see that a core dev views JSON dot accessing and PCQ as "sensible" future additions to Postgres.

[1] - https://peter.eisentraut.org/blog/2023/04/18/postgresql-and-...

link

wslh 1156 days ago

Basic question: is it correct to assume that having PGQ involves a big change in the database engine?

link

pphysch 1156 days ago

AFAICT the idea is that you are not directly querying the tables as a graph, but you construct a graph "view" from the tables, and then query that graph using PCQ.

link

zubiaur 1156 days ago

Correct. The one RD I know that has implemented this "graph view" concept is Oracle's. They did it first with PGQL extensions on top of say 19c. Interesting, but the execution seemed a bit unpolished.

Now with 23c, they are adopting PGQ and one should be able to interface through the regular JDBC connectors. Will see how it shapes up.

link

zozbot234 1156 days ago

It's just a different language and a simple "property" layer over the existing data. No changes to the internals are necessary.

link

WorldMaker 1156 days ago

No changes are required by PGQ but some engines may wind up using PGQ as a signal to bring in new types of indexes or disk storage formats to back it.

It may be interesting to see if real world usage of PGQ pushes the database engines to do interesting things.

link

thanatos519 1156 days ago

PGQ looks neat - create a "property graph" from a relational model, then query it via Cypher-like expressions. The best or the worst of both worlds, depending on implementation quality.

link

justinclift 1156 days ago

As a thought, it might be better to use the https:// link to Peter's overview. :)

link

MarkusWinand 1156 days ago

Fixed. I wonder why Google sent me to http...

link

mariuz 1156 days ago

And here is the article on the status of SQL:2023 support in PostgreSQL https://peter.eisentraut.org/blog/2023/04/18/postgresql-and-...

link

Zpalmtree 1156 days ago

I like the DISTINCT / NOT DISTINCT unique NULL option, I was wanting this feature just a few weeks ago

link

MarkusWinand 1156 days ago

That particular one is already available in PostgreSQL 15.

https://modern-sql.com/caniuse/unique-nulls-not-distinct

link

bionhoward 1156 days ago

Which SQL DBs support these features now? Who is almost there? I’m definitely excited to try it!

link

ksec 1156 days ago

I wonder when or even if MySQL will adopt any of these.

link