|
|
|
|
|
by hendiatris
387 days ago
|
|
In the lower level arrow/parquet libraries you can control the row groups, and even the data pages (although it’s a lot more work). I have used this heavily with the arrow-rs crate to drastically improve (like 10x) how quickly data could be queried from files. Some row groups will have just a few rows, others will have thousands, but being able to bypass searching in many row groups makes the skew irrelevant. Just beware that one issue you can have is the limit of row groups per file (2^15). |
|