Hacker News new | ask | show | jobs
by stubish 12 days ago
The article doesn't mention the biggest problem with serializable isolation. At every commit, you need handle the possibility of a serialization exception and retry the transaction. Traditionally devs and frameworks don't, so your application works fine during development and staging but starts failing under load. It makes commit failures normal, rather than an 'oh shit' problem because your disk has filled or someone has tripped over a network cable.

And how do you retry transactions? Then you hit another issue when using multiple datastores, where you need to learn about two-phase commit and the joys of manually keeping datastores in sync that don't support it (eg. filesystems).

And the locks, if you dare run batch updates along with web requests. The long running transactions lock everything they read, blocking short transactions. Because that is exactly what you asked for. Again, you will miss this during development and only notice under load.

So sure, you might avoid some data consistency issues if your data model and update patterns hit the edge cases. In practice, the reason details about serializable are not well known is the cases are rare. Using it gives you safety (maybe that rare case is your case!), but everything needs to be carefully designed around it.

4 comments

A bigger issue is scalability. Note that this blog post claims YugabyteDB uses serializable as the default isolation level, but if we look at the latest version of their docs it seems they backed off that and at some point the default became snapshot isolation, and then they backed off again and for new databases the default is read committed i.e. the same as Postgres and Oracle:

https://docs.yugabyte.com/stable/explore/transactions/isolat...

So they do allow write skew anomalies.

This happens because serializability requires the database to track every read and intersect them with every write, meaning every node in a distributed cluster must be aware of what every other node is doing. That hurts scalability and performance badly; it's one reason Oracle doesn't support true serializability (the other being that almost no existing DB backed apps can tolerate transaction aborts, as you observe).

There are other issues too. With pessimistic locking you can get deadlocks, but if you deadlock your app the database can detect that and get it moving again by rolling one of the transactions back. It surfaces to the user as a bug that they can easily monitor for and diagnose. With optimistic locking+retry, which serializability basically requires, you can get livelock. Livelock is terrible, the cluster appears to be busy doing useful work but just gets slower and slower. There's not good ways to automatically detect it, at least not in any app framework I've ever encountered.

Disclosure: in the Oracle DB group, opinions are my own and not that of the company. I actually argue with my manager about this topic sometimes, lol.

My experience just differs from you so much. Two of my employers extensively use serializable isolation and the code to handle retries is automatic. It helps that there are other reasons for retrying the transaction, like load shedding. The only bug I’ve observed is when a colleague forgot that the transaction could retry and the code wrote some logs to disk multiple times; it wasn’t even a correctness bug just a performance bug.
I'd say I had worse problems than many due to mixing long and short transactions on the same db. Which is something you need to avoid anyway if possible. The main webapp I dealt with also handled retries automatically, so we could do schema migrations live (with carefully crafted DML and other backend systems shutdown). I think we kept the webapp read committed (PostgreSQL backend), because that system didn't need the consistency guarantees.

Personally, I don't think databases or database libraries should specify a default isolation level at all, and that it needs to be explicitly specified. But legacy code and backwards compatibility and new user experience and all that. I think most of the issues I pointed out come down to needing to be aware of the consequences of your choices, or not even being aware that a choice has been made that can be very hard to change retroactively.

> At every commit, you need handle the possibility of a serialization exception and retry the transaction.

Yeah, but it seems so strange to me. Imagine if a database simply executed all transactions serially. Then there would be no serialization anomalies (though it would be slow, yada yada). So it seems serializable isolation presents a facade of serial execution, but only like, half a facade. You have to deal with the leak in the abstraction yourself and it's surprising to everyone who hears about it for the first time. I wonder why this choice was made.

In SQLite all writes are serialisable by default and it scales really well. I think having a single writer is what makes the big difference here.