Hacker News new | ask | show | jobs
by joelwilsson 3143 days ago
BigQuery uses a lot of tricks to get efficiency, but this post emphasizes Apache Arrow and open data formats like it as the way forward (in particular the last point, "Be open, or else…") which are not currently supported by BigQuery.

If Apache Arrow takes off I hope BigQuery will support it as a data interchange format in the future. Zero-copy is pretty awesome, as are open standards in general. This feature does not exist in BigQuery today (as far as I know - definitely not as discussed in the source).

1 comments

One thing people commonly miss about this is this is all meaningless if you don’t have the corresponding runtime integration, and these techniques very much imply co-design, and therefore tight coupling, between the format and the runtime. To give a concrete example, to make any of this efficient and fast you must have predicate pushdown directly into decoder such that filters could skip the data they don’t need to decode. Some aggregations could be handled the same way.

So it’s a little incorrect to think of this as a “file format” in the first place. If you end up designing it like that, you’d not be able to have a lot of the gains that the Abadi paper (and others like it) alludes to. My suggestion would be to go whole hog and push down as much filtering and aggregation in there as is feasible, exposing a higher level interface with _at least_ filtering predicate support, and do it in C++.