Hacker News new | ask | show | jobs
by petergeoghegan 999 days ago
> insert...on conflict is a fine way to do upserts on older PostgreSQL versions. For modern ones (15+) there is a better alternative [SQL Standard: MERGE]

This is incorrect. To quote the Postgres MERGE docs:

"When MERGE is run concurrently with other commands that modify the target table, the usual transaction isolation rules apply; see Section 13.2 for an explanation on the behavior at each isolation level. You may also wish to consider using INSERT ... ON CONFLICT as an alternative statement which offers the ability to run an UPDATE if a concurrent INSERT occurs. There are a variety of differences and restrictions between the two statement types and they are not interchangeable."

MERGE can give duplicate violation errors when concurrent inserts take place (at the default READ COMMITTED isolation level). MERGE is no more capable of avoiding these sorts of race conditions than totally naively application code. In short, upserting is precisely what you're not supposed to use MERGE for -- ON CONFLICT is expressly designed for that use case.

3 comments

I agree. In SQL Server, merge comes with a number of quirks that cause me more pain when I try to use it. The ONLY reason I use MERGE today, isn't to merge, but to include non-inserted columns in the OUTPUT clause; only the merge statement can do that. This can be useful when you want to output a lookup from a source to a newly inserted identity (auto-increment) column.

Generally, stay away from MERGE. At least in SQL Server.

I've done a lot of work on MERGE queries recently. For postgres I suspect that self updating CTEs that report back what actually happened are maybe a better idea.
MERGE is standardized in ANSI SQL 2003, and updated in 2008:

https://en.wikipedia.org/wiki/Merge_(SQL)

No question, everyone should use standardized, cross-platform syntax if at all possible.

Maybe someday ANSI will standardize some kind of upsert syntax; until that time use upsert only if you are explicitly tying your SQL to your particular database.

> No question, everyone should use standardized, cross-platform syntax if at all possible.

There are big questionmarks around this statement. SQL isn't an interchange format. Standard table design goes a long way because data has lots of different consumers and needs frequently to be moved around between systems, standard SQL is nothing useful to aim for. Everyone already uses DB-specific drivers and syntax because it is better than the alternative.

- The standard is not generally available. Most of us will never learn what is in it.

- SQL standardises terrible syntax. The SQL standards committee has a unique view on programming where "function_name(a, b, c);" is some sort of weird alternative syntax. They haven't gotten as far as working out variables or function composition either [0].

- Tying SQL to your specific database is the best option for performance. Writing database-independent SQL is somethign of a fools errand, it is either trivial performance-insensitive enough that you should be using a code generator or complex enough to deserve DB-specific turning.

[0] I assume. Not going to pay for a copy of the standard to check.

> The standard is not generally available. Most of us will never learn what is in it.

It also not aimed at users, but at implementors. Funny enough, they don't read the standard either ;) But more seriously: Some implementations are old and generally vendors prefer not changing the behavior of their product. When the standard comes later, the train has already departed. The most critical incompatibilities are in the elder features. The newer ones have a tendency the be more aligned with standard behavior (e.g. window functions are typically implemented just fine).

> They haven't gotten as far as working out variables or function composition either [0].

Part 2 SQL[0] is declarative and intentionally doesn't have variables. Part 4 SQL (aka "pl" SQL) does have variables. I personally consider Part 4 obsolete on systems that have sufficient modern Part 2 support.

> Tying SQL to your specific database is the best option for performance. Writing database-independent SQL is somethign of a fools errand, it is either trivial performance-insensitive enough that you should be using a code generator or complex enough to deserve DB-specific turning.

While this is certainly true for some cases there are also plenty of examples where the standard SQL is more concise and less error prone than product-specific alternatives. E.g. there COALESCE is the way to go rather than isnull, ifnull, nvl, or the like (typically limited to two arguments, sometimes strange treatment of different types).

There is a lot of *unnecessary* vendor-lock in in the field.

[0] https://modern-sql.com/standard/parts

> Part 2 SQL[0] is declarative and intentionally doesn't have variables...

The issue I have with that is that making a bad decision for a reason doesn't change the fundamental correctness of the decision.

SQL semantics clearly support variables, because they are available as WITH ... clauses and subqueries. So either these elements are a mistake and should be discouraged, or the language has variables and they should be made to read like variables in almost all other languages for consistency and readability.

This is literally a cosmetic issue but a lot of thought on syntax has happened since the 1970s and there seems to be an overwhelming on-the-ground consensus that SQL's syntax is not the way to lay code out. The semantics? Great. But there is nearly nothing from SQL's syntax choices that is present in any surviving programming language since. Many of the more popular languages explicitly reject SQL syntax decisions like Python (significant whitespace), "function(arg,arg,arg)" instead of constructs like "VERB arg,arg WITH TEXT,arg TEXT arg" (everything I've met bar lisp) for functions and not replicating broken English (everything).

Variables are a different thing because the declarative nature of SQL is a useful property. But standard SQL overall does such an appalling job of handling syntax I feel confident that they have gotten this part of the syntax wrong too.

Gotten a bit off topic, just happens to be a thing I feel strongly about.

Lisp is in no way closer to SQL syntax than languages like python. On the contrary, lisp syntax is much more regular.
Sigh, I never wanted to read the SQL standard before, but now that I know it’s behind a pay wall I want to read it.
I don’t believe there is a SQL standard.
What most users want is a statement that allows them to UPSERT, with reasonable guarantees around that never throwing a duplicate violation error, and never deadlocking (at least not in any novel way caused by implementation details). Something practical.

That's not the kind of thing that the standard takes a position on, for a variety of reasons. In general, it says very little about concurrency control/isolation. And what little is does say is all very abstract.

For whatever reason, a lot of people believe that MERGE promises something over and above what using multiple INSERT, UPDATE, and DELETE statements will promise. But that simply isn't true. The standard doesn't say anything that even vaguely suggests otherwise. Same is true of both the SQL Server MERGE docs, and the Oracle MERGE docs.

Before you recommend anything, read this about MSSQL's dangerously borked MERGE https://www.mssqltips.com/sqlservertip/3074/use-caution-with...

Anyone using MSSQL should read it.