Hacker News new | ask | show | jobs
by malisper 2079 days ago
For those unaware, Zheap is a new storage engine for Postgres that handles updates and deletes in a different way. Currently, when you update a row in Postgres, Postgres creates a new copy of the row and marks the old one as deleted. This is done for several reasons, specifically it makes handling concurrency easier and it makes rolling back transactions easier.

The issue with this approach is over time this leads to lots of "dead rows" - deleted rows that are taking up space in your DB. Postgres has a background job called the "vacuum" which effectively garbage collects all deleted rows. Depending on your settings, the vacuum can be a pretty expensive job and even with the vacuum, you can still wind up using a lot more space than you actually need.

Zheap addresses these problems by using a different approach. When performing an update, instead of marking the old row as deleted and inserting a new one, Postgres will replace the old row with the new one and write a copy of the old row to a separate file. This means the main table file doesn't need to be larger than necessary to keep track of the dead rows.

Zheap does lead to lots of tricky scenarios. If for any reason you need to access the old copy of the row, you have to fetch it from the separate file. If the transaction that performed that update is rolled back, you need to replace the new version of the row with the old version of the row. This sounds straightforward, but gets really tricky really fast. For example, what happens if I have row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b. If you want to rollback the original transaction, row X will no longer fit in its original spot because row Z is now taking up part of the space it used to occupy.

11 comments

> Zheap does lead to lots of tricky scenarios. If for any reason you need to access the old copy of the row, you have to fetch it from the separate file. If the transaction that performed that update is rolled back, you need to replace the new version of the row with the old version of the row. This sounds straightforward, but gets really tricky really fast.

This is pretty much what Oracle is doing.

I have the same impression. It solves all the autovacuum problems, but has a different set of tradeoffs, e.g. the infamous 'ORA-01555 snapshot too old' when your query tries to read the old version but it has already been cleaned away.
Ah the memories! Yes, the "other file" talked about in these comments is also not an unlimited resource and, depending on implementation specifics, I've found can be more limiting than the current PostgreSQL MVCC approach. In Oracle, I use to see data maintenance related procedures being written that would loop and update some large data set, but intermittently commit progress every so often... this would end up exhausting the undo availability. The more frequent the commits, the faster you'd get to the ORA-01555 before the end of the run. Not saying the procedure was right... just that the pattern was common as was the error.

Also, not too long ago PostgreSQL added Stored Procedure support... which allows for mid-transaction commits. Maybe it will be a case of "what goes around, comes around".

I should note I'm not at all against this Zheap idea, stored procedures, or multiple approaches to MVCC/storage back ends in PostgreSQL... being able to choose my effective pros/cons for a project is really desirable. But, solving the vacuum problem will create new problems.

EnterpriseDB (which started this project) specializes in providing postgresql version that can emulate most of Oracle database functionality, so this is not surprising.
No, it provides a basis to ease your migration from Oracle. There is a lot it doesn't do. Basically, if you have a homegrown app on Oracle it removes a lot of the low hanging fruit to migrate but you aren't running peoplesoft or something on it.
I personally never used it, but worked for a company who at the time used EDB to migrate away from Oracle. They apparently had a tons of stored procedures and they mostly worked, but it is possible they didn't use any of the things you mentioned.

Anyway maybe I should say differently, maybe they don't provide most of the functionality, but their goal is provide database that can help to migrate away from Oracle, and this storage engine that appears to mimic one in Oracle (and I'm guessing has similar tradeoffs) fits that model.

> emulate most of Oracle database functionality

What is the basis for this statement?

* MATCH_RECOGNIZE isn't supported

* reference partitions aren't supported

* TIMESTAMP is incompatible and the default compile options are comically bad

These are the first three things that came to mind and none of them are supported so I stopped looking further.

Its also how MySQL works, using overwrite+undo_log rather than multiple row versions in the same table.
Thanks for this. So I’m fairly familiar how MVCC relates to CoW, and the way that I read this is that it’s kind of the inverse of CoW: it’s an in-place update and a copy of the old data.

I can completely imagine the corner cases this ends up touching, as there must be an insane amount of logic in Postgres that assumes immutability of data on disk: if I read data block X for transaction version 1234, it will always remain the same.

It’s a very interesting approach though. Once all the corner cases are solved and delt with, I wonder: are there any reasons not to choose zheap over the standard storage engine?

> I can completely imagine the corner cases this ends up touching, as there must be an insane amount of logic in Postgres that assumes immutability of data on disk: if I read data block X for transaction version 1234, it will always remain the same.

Most of the code having such low level assumptions has been centralized for PG12, as part of the introduction of pluggable table storage. Also note that that assumption as stated isn't actually true - we don't expect on-disk data to be immutable. There is on-access "pruning" of dead rows, there's VACUUM, there's new row versions on the same page etc.

There are probably some cases where traditional heap would be better that (ideal, finished) zheap. Some ideas: Crash recovery is faster without an UNDO phase. Accessing data that has been modified many times since you took your snapshot may be slower if you have to walk back in time through multiple hops to find the right version of each tuple (compared to the regular heap, which also has to step over lots of dead tuples in that case but can do so sequentially). There may be space management problems when there are too many active transactions interested in a page. There may be microscopic overheads that show up in some workloads that don't benefit from updating in place, perhaps because they are insert/delete-only.

I guess both options would be good to have, and it's interesting that recent SQL Server can do it both ways at the user's option (see "Accelerated Database Recovery", something that sounds conceptually a bit like PostgreSQL vacuum instead of ARIES style rollback where you have to undo changes).

the cow approach likely plays better with flash based storage since rewrites happen less often. so if you've got a lot of otherwise extra storage then maybe it's a good idea. though if you run on top of something like f2fs maybe it'll be better in that respect than pg does on it's own.
IIRC one issue with the current design is that as new rows are written in a different place, all indexes on that table have to be updated as well, leading to write amplification. If you do an update in place, you don't need to modify indices (but now that I think of it, what about concurrently running transactions that might want to read the old values using the old indices...?)

Also during the vacuum there's a lot of writes marking rows as available.

In theory, yes, but PostgreSQL's current storage is not optimized for flash.
> row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b.

I thought there is a fixed amount of space allocated per row and varying length blobs are stored elsewhere, is my understanding wrong?

Yes. Rows are variable length. Think null values, strings, etc.

Only really large values that don't fit into a page are stored elsewhere, and only if they can't be compressed and made to fit. I think it's > 3kb. But I don't remember why that number comes to mind. Pages are 8kb.

2kb by default, configurable per-table with the toast_tuple_target storage parameter if you want more control: https://www.postgresql.org/docs/current/sql-createtable.html
Your write-up (or something similar) should be on their front page. Most projects assume you already know everything about them.

I don't remember the name but there was a NoSQL DB on HN a couple weeks ago. I had to Google and read different websites for a solid 10 minutes before I got that it was indeed a NoSQL DB and what was the differentiating factor from other DBs.

Oof. That sounds super hairy. Don’t you run into all sorts of weirdness with things like bitmaps expecting to find a specific version of a tuple at a specific location in the heap?
Nope. Any time you are looking at a tuple in the existing heap, you have to test it for visibility against your snapshot. With zheap, it's much the same, except that, rather than just "yes or no", the answer could be that you have to following an update chain into the undo log to get an older version.
Ah gotcha. Thanks for the explanation, makes sense. So I guess you still have transaction ids you need to honour on your way back.
This is more or less how MySQL works, hence it's lack of vacuum.
This sounds like temporal tables, does it not - except deep inside the storage engine?? The historical copy goes to a separate table, the current row is in the current table.

I wonder if there is a logical overlap in the ideas between temporal table handling and this low level storage engine optimization.

Indeed, Postgres used to have "time travel" [1] — allowing you to view the table as it was a given point in time. It was removed in 6.2, though some of

Of course, support doesn't require Postgres-style MVCC. Oracle has time travel, which it calls Flashback ("SELECT AS OF"), even though it uses a different type of concurrency control.

[1] https://www.postgresql.org/docs/6.3/c0503.htm

PostgreSQL used to have a limited form of unitemporalism by exploiting its MVCC implementation. You could retrieve unvacuumed rows through some additional magic.

For true bitemporalism you need more than current-row/historical-row separations, though.

> I have row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b. If you want to rollback the original transaction, row X will no longer fit in its original spot because row Z is now taking up part of the space it used to occupy.

But this only can happen if replace X with Y and add Z are in the same transaction. If you rollback the transaction, then you start by removing Z, then replacing Y with X. What I'm missing here?

> But this only can happen if replace X with Y and add Z are in the same transaction. If you rollback the transaction, then you start by removing Z, then replacing Y with X. What I'm missing here?

Z can be in a different transaction than X and Y. If two transactions run concurrently, one that replaces X with Y and a second one that inserts Z, the above scenario can happen.

No it cant. If they run concurrently, Z has no notion of Y happening. This is the whole point of transactions.
Transactions also have no "notion" of storage implementation details, so your point really isn't relevant to the topic at hand. They operate at a higher level of abstraction.
Thanks for the explanation. In layman terms how much is the benefit from separating the old rows from the database vs the cost of accessing it from a separate file?

Just curious why it’s being done now as sounds like a major design decision that would have been considered a long time ago

In Postgres 12, the pluggable storage API was introduced. That allows use of multiple different storage engines with Zheap being one of the new storage engines.

IIRC, one of the big reasons for implementing pluggable storage was for Zheap.

FWIW, the existing implementation has worked really well for most use cases.

It's a complex tradeoff. It depends on the specific workload which scheme is better. There's no clear winner. This is why it's being introduced as a pluggable choice vs a wholesale change.

Andy Pavlo's class has a lot of good general information on the topic: https://15721.courses.cs.cmu.edu/spring2020/slides/03-mvcc1....

To put it simply (very simply), most of the time you don't need old rows (rollbacks are rare, crashes are rare, etc.). If you store old rows (undo) in a separate place, you don't have to read (and skip) them all the time while reading actual data. That's the main benefit.
> replace the old row with the new one and write a copy of the old row to a separate file

Does it do this in the reverse order? If not, how does it prevent dataloss in case the main table update succeeds but writing to the "old copy" table fails?

Well, conceptually it happens atomically. The change is described in a single WAL (write ahead log) record, and data in the buffers representing the table and undo log isn't allowed to touch the disk until the WAL is on disk, so in any crash/restart scenario that leaves the job half done, we'll simply do it again when we replay all changes since the last checkpoint. A related question is: what happens to the uncommitted change after that? The system recognises that the change belongs to an uncommitted transaction, and rolls it back, which in this case means that we'll copy the old copy back into the main table before allowing any user to access it, and free up the space in the undo log. This is approximately how most other RDBMSes work.
Ah of course, how silly of me to forget about the log. Thank you, that makes perfect sense.
It's more complex than the above. The key constraint is actually that the transaction is not allowed to return as committed until the WAL is stable on disk. Whether the writes it did to various pages are allowed before that point or held until after is an implementation choice, and different databases make different choices. It gets complex fast, but the short summary is that the recovery protocols understand how to reconstruct the state after the fact, redo what needs to be redone, and undo what needs to be undone.
> Does it do this in the reverse order?

I would guess so, but I haven't looked up the implementation so I'm not sure. There's a ton of race conditions that can come up depending on the exact order you write things so I'm sure the actual implementation is pretty messy.

Have you stress tested it? How does this end up faring with large scale deployments?
If you look at the most recent status update on the project:

> 12-10-2020: Most regression tests are passing, but write-speeds are still low.

In theory this is much better than what Postgres currently does. Performance depends on autovacuum cleaning up the mess you leave behind, with the amount of dead tuples increasing you also get index bloat which leads to reading much much more data than necessary. I see 3-5x read amplification before vacuum kicks in on update heavy tables (with vacuum tuned a lot).

This is basically what Oracle does with UNDO and it is obviously very scalable. Vacuum on the other hand has its limits, there is only so much you can squeeze out of a single threaded (at table level) process.

What I look most forward to is that it should enable FLASHBACK queries.

I doubt that because in practice PostgreSQL is faster on many workloads than Oracle and InnoDB despite Oracle and InnoDB using similar models to zheap. If the theoretical difference has been large then we should also expect to see big differences in practice, which is not the case.

While zheap may be a superior design I do not think you should expect any major performance improvements other than on some very specific workloads.

Undo logging vs redo logging (except both are MVCC).