Hacker News new | ask | show | jobs
by ComodoHacker 2232 days ago
Among modern successful RDBMSes employing MVCC only PosgreSQL (correct me if I'm wrong) stores stale row versions on the page. Other vendors store them in some physically separate place (tempdb in MSSQL, UNDO tablespace in Oracle, etc.) This reduces both regular read (scans) and write (page splits) I/O and eliminates the need to vacuum (more I/O).

Of course this approach requires some maintenance of that separate place. But such maintenance doesn't compete for I/O with regular workload and doesn't hurt performance.

So after decades of progress and improvement it looks like this latter approach turned out to be more efficient and is winning historically.

6 comments

Actually MSSQL still has a vacuum system under the hood ("ghost record cleanup task") that is needed even if you have disabled all MVCC features. It is not as critical as in PostgreSQL, since there will be fewer rows to clean up.

It runs automatically in the background, with rather conservative settings that usually prevent it from having much perf impact.

The system also ties into MVCC because deleted rows in the main table cannot be removed while any query can still access some other version of that row. The same is true when changing any key column in an index. In those cases, the update is a delete and insert within that index. And the deleted row must stick around if there is any transaction that could read it.

So while SQL Server does not get as many rows that needs to be vacuumed as PostgreSQL does, just storing the versions separately certainly does not completely eliminate the need for after the fact cleanup.

This is exactly what MySQL (and I think Oracle) does. The final delete of a row remains tombstoned in the main table until a background process determines it is safe to remove based on transaction visibility (MySQL calls this purge).

It seems the important difference is whether the table needs to be scanned to find dead rows. In Postgres it seems the answer is yes. In MySQL it keeps this info as a separate scratchpad to quickly find the rows to remove once they are allowed to be.

Both approaches have advantages and disadvantages. For example, in MySQL, if you have a very long running transaction such that it can't purge rows for a long time, you can actually overflow the rollback segment onto disk and purging slows down dramatically. You might even need to pause writes to get it to recover, although I haven't ever seen it get that bad. With a table scan approach it degrades more gracefully in that scenario.

One annoying wrinkle in MySQL's purge implementation: it won't progress to any old row versions that are newer than the oldest transaction, even if there are no transactions that could ever possibly access those rows. [1]

As you noted, this can get painful since purge effectively gets blocked -- if you have a high-volume OLTP workload with heavy emphasis on quick small UPDATEs / DELETEs, and then introduce a single very long-running read transaction (such as a logical dump/backup), purge can't make progress. But this is just due to the current implementation. In theory a smarter system could differentiate between row versions actually needed by old transactions vs ones that are inaccessible to all transactions.

[1] https://bugs.mysql.com/bug.php?id=74919

You can read how different databases provide MVCC at my other blog: https://amitkapila16.blogspot.com/2015/03/different-approach...
As far as I understand, the trade-off of storing stale rows in a separate place is that you need to bring them back if the transaction rolls back, making rollbacks more expensive.

Also, you need to copy entire row before updating it or deleting it, not sure how significant this cost is in practice.

Yes, but expansive rollbacks are acceptable as they are very rare in production.
This is somewhat confused/mistaken.

There are several successful databases that use in page MVCC storage. I'd suggest going through Andy Pavlo's slide decks for a summary: https://15721.courses.cs.cmu.edu/spring2020/, in particular slide 33 from lecture 3 has a summary. Also there are papers wheree his group directly benchmarks the various design alternatives within their test database, Peloton.

The TL;DR, is there's a lot of details to how this stuff works, many tradeoffs involved, and it's by no means whatsoever clear that Postgres has somehow made the wrong choice.

Thank you for the link, I'm really interested in this topic. However I can't figure out how to access those slides. Do you have to be enrolled?
Yeah, he does two big database classes, and the last couple years of each are online with slides, youtube of the lectures, etc. Just click on the schedule for each class. They're amazing resources for learning the state of the art.
Today I just woke up to find yet another "ORA-01555: snapshot too old" on something I left running overnight.

Honestly, performance isn't as important as the capacity of getting an atomic snapshot of your data with any reliability.

Avoiding ORA-01555 is a matter of properly sizing your undo.
And a matter of not getting any task that is slower than what you expected.