|
|
|
|
|
by Demiurge
212 days ago
|
|
What if it was 650TB? This article is obviously a microbenchmark. I work with much larger datasets, and neither awk nor DBD would make a difference to the overall architecture. You need a data catalog, and you need a clusters of jobs at scale, regardless of a data format library, or libraries. |
|
working memory requirements
So for each date in the dataset we need 16 bytes to accumulate the result.That's ~180 years worth of daily post counts per gb ram - but the dataset in the post was just 1 year.
This problem should be mostly network limited in the OP's context, decompressing snappy compressed parquet should be circa 1gb/sec. The "work" of parsing a string to a date and accumulating isn't expensive compared to snappy decompression.
I don't have a handle on the 33% longer runtime difference between duckdb and polars here.