Hacker News new | ask | show | jobs
by trhway 2949 days ago
>If a set of transactions must maintain some kind of invariant within the database (for instance, the sum of some set of fields is always greater than zero). In a database that guarantees serializability, it’s sufficient to verify that every individual transaction maintains this invariant on its own. With anything less than serializability, including Snapshot, one must consider the interactions between every transaction to ensure said invariants are upheld.

this is a difference between practice and theory. The invariant mentioned above is of theoretical interest. On practice such invariant would look like "sum of some fields is always equal to the value in that field of that row". (left as an exercise to the reader to see that that former theoretical invariant, and actually any other, can, without loss of generality, always be implemented as the practical invariant with the target field :) As basic Read Committed still means serialization of write access to that "sum" target field, it is enough to verify only that each transaction writing to any of those fields is also writing to that "sum" field and that it does uphold the invariant. There is no need to pay the huge performance price of serializing those transactions with all the rest of transactions in the system. This is why Read Committed is veriafiably enough for correct execution of so many applications (i.e. for any application whose invariants are implemented in that "practical" form). It is basically like multi-thread programming with correctly implemented synchronization of access to shared data.

2 comments

> it is enough to verify only that each transaction writing to any of those fields is also writing to that "sum" field and that it does uphold the invariant

This is a good try at a theory of when you can use weak isolation. But as stated, I believe it is false. If the invariant is A+B=C, I think this sequence is permitted by READ COMMITTED:

T1: read A=1

T2: read B=1

T2: write A=2

T2: write C=3

T2: commit

T1: write B=10

T1: write C=11

T1: commit

Now A=2, B=10, C=11

More to the point, invariants aren't always local to a row, and people don't usually even fully articulate them.

> There is no need to pay the huge performance price of serializing those transactions with all the rest of transactions in the system

Serializable execution can be made very very fast when there aren't actually a lot of conflicts. And then when there are you can do the hard work of trying to prove that something weaker will work. The failure mode in the other direction is silent corruption of your data.

What i was talking about in its most simple (though not most performant/scalable as it would be more complicated, so we wouldn't go into it now) implementation would look like this - the transactions involved with an invariant should start with locking of the "target" field of the invariant, the field C in your example:

T1: lock C

T2: attempt to lock C - blocked until T1 completes.

I.e. we have these 2 transactions serialized between themselves without unnecessary extra serialization with all the rest of the transactions in the system. The "target" field just plays the role of the mutex associated with the invariant (and this is why any invariant can be converted to that form just by having a mutex, ie. a "target" field, associated with it). From the theory of multi-threaded programming with mutexes we just know that everything can be implemented correctly in such a model. I.e. any application can be made correctly with READ COMMITTED isolation if required set of "mutexes" and their acquire/release is correctly implemented.

>Serializable execution can be made very very fast when there aren't actually a lot of conflicts.

not exactly. Serializability by itself is pretty expensive global invariant and your application would be paying for it in performance and scalability even though it may not really need it.

> any application can be made correctly with READ COMMITTED isolation if required set of "mutexes" and their acquire/release is correctly implemented

This is true, but it's "a big if." And the result might be faster than a good implementation of serializable isolation (since the latter is working to maintain some invariants that you don't actually care about), but it could easily be much slower: very broad invariants will require very broad locks which cause much more "serialization" than is required for serializability.

Let's say you want the same sum invariant but across rows: all users' balances must sum to zero. The straightforward application of your reduction is to essentially have a global mutex for all balance transactions, but this is nowhere near the most scalable solution. Whereas a good serializable database will handle this situation scalably, safely, and automatically.

In my experience, many useful invariants cannot be trivially expressed in a singular "target field" against which one may lock, but that's actually beside the point.

The point is that--insofar as it comes to correctness--serializability allows local reasoning, whereas weaker isolation levels require global reasoning. You can't argue against that by saying "well, if every transaction just explicitly locks the appropriate records at the appropriate times …" because that's exactly the type of global reasoning that we're trying to avoid.

There certainly is a difference between theory and practice here, but I don't think it's in favor of weaker isolation levels.

In theory, as you note, one can enforce any invariant–even in Read Committed–by taking a global lock in every transaction. In practice, nobody does that, because it takes concurrency to zero.

In theory, one can regain concurrency by judiciously breaking that global lock into some application-specific hierarchy of locks, and by having the discipline to adhere to that bespoke locking protocol for every transaction, now and in perpetuity. In practice, that is both too expensive and too error prone to be applied successfully in any non-trivial system (though some do still try).

In theory, these concerns can be addressed at the application level. In practice, it is grossly irresponsible that we (as an industry) continue to foist these concerns on application developers who–in the aggregate–cannot reasonably be expected to get it right. The number of viable database systems is small and growing slowly, while the number of database-backed applications is large and growing rapidly. To have any hope of improving the status quo, well: I know where I'd put my money.

>In my experience, many useful invariants cannot be trivially expressed in a singular "target field" against which one may lock

just trivially dedicate/associate a field with each invariant :)

>The point is that--insofar as it comes to correctness--serializability allows local reasoning, whereas weaker isolation levels require global reasoning. You can't argue against that by saying "well, if every transaction just explicitly locks the appropriate records at the appropriate times …" because that's exactly the type of global reasoning that we're trying to avoid.

The reasoning required is the same in both cases. To avoid serialization conflicts you still have to identify all those appropriate records and appropriate times in your reasoning. So the choice is either to explicitly implement required locking for RC or let the SERIALIZABLE do it implicitly under the hood (and hoping that it will do it better :).

>one can enforce any invariant–even in Read Committed–by taking a global lock in every transaction. In practice, nobody does that, because it takes concurrency to zero.

in naive (on practice of course it is more complex and better as result) implementation of what i said the worst case concurrency is the number of disjoint set of invariants. And SERIALIZABLE would do about the same concurrency or worse.

>In practice, that is both too expensive and too error prone to be applied successfully in any non-trivial system (though some do still try).

in my experience any non-trivial system wouldn't run in SERIALIZABLE without hitting the conflicts until the system gets subjected a lot to that "global reasoning" that you mentioned. And flushing all those cases is very time and effort consuming because it is "global" reasoning about implicit locking and relations. The explicit lock based model for RC is more simpler mode of thinking and thus much more practicable and trackable. And this is why we all run RC. At least it runs and if some reasoning was put into it - it runs correctly, fast and scalable.

The reasoning required is the same in both cases. To avoid serialization conflicts you still have to identify all those appropriate records and appropriate times in your reasoning.

Well, sort of. Serializability allows you to reason locally for correctness, but not for performance. That is, however, a meaningful distinction.

Serializability remains a marked improvement over the status quo. Most operations aren't actually performance sensitive, so that tradeoff makes perfect sense. With serializability, we need only focus our efforts on those operations which are both performance sensitive and highly contended.

This has nice emergent properties, because the problematic transactions are immediately apparent: they are the slow ones, the ones that are not meeting their performance budget.

Performance anomalies are directly measurable and immediately visible. Correctness anomalies are generally neither.

Performance anomalies are generally localized and ephemeral. Correctness anomalies are often persistent and viral.

So the choice is either to explicitly implement required locking for RC or let the SERIALIZABLE do it implicitly under the hood (and hoping that it will do it better :).

Again, when it comes to correctness, there's no real hope to do better than serializable. When it comes to performance: sure. But in every application domain I've worked, getting to a wrong result faster is rarely useful. (I understand that this is not a universal truth, but I believe it is significantly more common than the alternative. The adage "first make it work, then make it fast" comes to mind.)

in my experience any non-trivial system wouldn't run in SERIALIZABLE without hitting the conflicts until the system gets subjected a lot to that "global reasoning" that you mentioned.

I agree that there have been (and still are) practical impediments to deploying large-scale serializable systems with incumbent technologies. But I disagree that those impediments are inherent to all implementations of the serializable isolation level, which is why these kinds of articles (and the R&D behind them) are so important.

We still have a long way to go before serializable-by-default is a tenable option for the most demanding of systems, but that is what we should be working towards.