| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ivoras 1964 days ago
	Is the partial index technique to avoid indexed NULL data as effective for PostgreSQL 13+? It looks like in v13+ PostgreSQL could create a single leaf for NULL data and just store row pointers within it, which should reduce data sizes at least a bit.

2 comments

mattashii 1964 days ago

Not per se _as effective_, but it will still help a lot. NULL tuples pre-pg13 take ~ 14 bytes each, and 18 bytes when aligned. (= 2 (ItemID, location on page) + 6 (TID) + 2 (t_info) + 4 (NULL bitmap) + 4 bytes alignment). When deduplication is enabled for your index, then your expected tuple size becomes just a bit more than 6 bytes (~ 50 TIDs* in one tuple => 2 (ItemId) + 6 (alt tid) + 2 (t_info) + 4 (null bitmap) + 50 * 6 (heap TIDs) / 50 => ~ 6.28 bytes/tuple).

So, deduplication saves some 65% in index size for NULL-only index-tuples, and the further 35% can be saved by using a partial index (so, in this case, deduplication could have saved 13GB).

*note: last time I checked, REINDEX with deduplication enabled packs 50 duplicates in one compressed index tuple. This varies for naturally grown indexes, and changes with column types and update access patterns.

link

mattashii 1960 days ago

heh, my calculation was incorrect: ItemID is 4 bytes in size, so the calculations are slightly off:

pre-13 was 16 bytes each (20 when 64-bit compiled), and post-13 it is 6.32 bytes/heap tuple when deduplication has kicked in.

link

chrismeller 1964 days ago

He actually mentioned index de-duplication earlier: https://hakibenita.com/postgresql-unused-index-size#activati...

If I had to guess, I would say that it doesn't accomplish anything (or as much as you'd think) for null values simply because there is no real data to store in either approach, you just have a bunch of pointers either way.

link

petergeoghegan 1964 days ago

NULL values are not special as far as deduplication is concerned. They use approximately as much disk space as a non-NULL integer column without deduplication, and compress just as well with deduplication. Deduplication is effective because it eliminates per-tuple overhead, so you see most of the benefits even with index tuples that naturally happen to have physically small keys. You'll still get up to a 3x decrease in storage overhead for the index provided there is low cardinality data (and not necessarily that low cardinality, ~10 or so tuples per distinct value will get you there).

The NULL issue is documented directly -- see the "Note" box here:

https://www.postgresql.org/docs/devel/btree-implementation.h...

link