| HN Mirror

> Hilbert-curve based clustering which solves a lot of the downsides of hive partitioning

Yes, that solved the 2-column high NDV partitioning issue - if you had your ip traffic sorted on destination or source, you need Z-curves, which are a little easier with bit twiddling for fixed types to do the same thing.

Hive would write a large number of small files when partitioned like that or you lose efficiencies when scanning on the non-partitioned column.

This does fix the high NDV issue, but in general Netflix wrote hidden partitioning in specifically to avoid sorting on high NDV columns and to reduce the sort complexity on writes (most daily writes won't need any partitioned inserts at all).

While clustering on timestamp will force a sort even if it is a single day.