| HN Mirror

One thing people commonly miss about this is this is all meaningless if you don’t have the corresponding runtime integration, and these techniques very much imply co-design, and therefore tight coupling, between the format and the runtime. To give a concrete example, to make any of this efficient and fast you must have predicate pushdown directly into decoder such that filters could skip the data they don’t need to decode. Some aggregations could be handled the same way.

So it’s a little incorrect to think of this as a “file format” in the first place. If you end up designing it like that, you’d not be able to have a lot of the gains that the Abadi paper (and others like it) alludes to. My suggestion would be to go whole hog and push down as much filtering and aggregation in there as is feasible, exposing a higher level interface with _at least_ filtering predicate support, and do it in C++.