|
|
|
|
|
by dijksterhuis
653 days ago
|
|
So with traditional Parquet this is usually handled through “sane” partitioning. Heavily simplified version — Each partition is a separate file containing a bunch of table rows. And partition splits are determined by the values in those rows. If you’ve got data with like a date column (sign up date or order date or something), you would partition on a YYYY-MM field you create early on. Each time you run a query filtering by YYYY-MM, your OLAP query tool no longer needs to read bunch of files from disk or S3. If you only want to look at 2023-12, then you only need to read one file to run the query. Edit — OLAP kinda stuff is all about getting the data “slices” nicely organised for queries people will run later. |
|