|
|
|
|
|
by arjunnarayan
2709 days ago
|
|
(author of the blog post here) I'd second ryanworl's comment that the rabbit hole goes much deeper than just storing things in a column oriented disk or in-memory format like Parquet or Arrow. That's just the first step. To get the best performance you have to have your data in an in-memory format that allows you to compress it efficiently, and then perform many relational operations on the compressed form itself. Another example is Run-length and delta encoding a sorted column of integers, and then building relational operators (e.g. a join) that operates directly on the compressed data. The best explanation for all the various techniques the go into the data structures and operator designs for OLAP workloads is the survey 'The Design and Implementation of Modern Column-Oriented Database Systems' by Abadi, Boncz, Harizopoulos, Idreos, and Madden: http://db.csail.mit.edu/pubs/abadi-column-stores.pdf |
|