Hacker News new | ask | show | jobs
by appplication 49 days ago
That’s a really cool idea I had not heard before, thank you for sharing this. It also feels like the type of thing a db ought to be able to do under the hood. I wonder why this is not a config (though there’s a pg extension for everything so maybe it does exist)
2 comments

As far as I can tell, Postgres is not designed with this inclination towards doing lighter work when clients are waiting and piling up maintenance work to do in background. I think the background work it does is mostly running vacuum on tables now and then.

Contrast that with ClickHouse, for example. It operates in a different niche than Postgres (OLAP instead of OLTP) – with their merge tree engine family [1] that does data deduplication in background.

There is one project of modernizing Postgres' storage engine called OrioleDB [2], but I think the company got acquihired by Supabase [3] and maybe the project has not been progressing very quickly since then.

[1] https://clickhouse.com/docs/engines/table-engines/mergetree-... [2] https://www.orioledb.com/ [3] https://supabase.com/blog/supabase-acquires-oriole

Oriole is under very heavy development since we acquired them. We cut a release just 2 weeks ago:

https://github.com/orioledb/orioledb/releases/tag/beta15

We expect it to be production ready this year

It wouldn't be atomic, and so would break transaction semantics.

If you committed a row update but didn't update the index, a subsequent query using the not yet updated index would not find the updated row correctly.

It would also only work for certain types of indexes, you couldn't do it for uniqueness constraint for example.

I do agree that in theory you could have some extension to the index declaration that covers all that, but my worry there would be that it would be non obvious and a foot gun. Doing it the way described above makes that break in semantics clear.

> If you committed a row update but didn't update the index, a subsequent query using the not yet updated index would not find the updated row correctly.

I wonder if you could make it so that queries read from both the index and the unindexed changes. It would be slightly slower but as long as the unindexed changes are kept small it might be fine.

My impression is that InnoDB (MySQL's primary storage engine) is doing something like this. We have never seen any slow-downs on adding to the data set I've discussed in this thread, even at hundreds of millions of rows, and per the nature of the system creating this data the majority of these rows are targeted for additional single-row DML within a few seconds of being inserted, with instantaneous effect.