|
|
|
|
|
by amluto
378 days ago
|
|
> Frankly, I'm not entirely sure what the process of proposing that change to the hive file scheme would even look like Maybe convince DuckDB and/or clickhouse-local and/or polars.scan_parquet to implement it as a pilot? If it's a success, other tools might follow suit. Or maybe something like DuckLake could have an option to put column statistics in the filenames. I raised this as a discussion: https://github.com/duckdb/ducklake/discussions/92 |
|
Imo range is probably the most useful statistic in a folder/file name anyways for partitioning purposes. My vote would be for `^` as the range separator to minimize risk of collision and confusion. i.e. `timestamp=2025-03-27T00:00:00-0800^2025-03-30-0700` or `hour=0^12`,`hour=12^24`. `^` is valid across all systems, and I'd be very surprised if it was commonly used as a property/column name. Only collision I can think of is that its start-of-line in regex