|
|
|
|
|
by ignoramous
1586 days ago
|
|
> Are you saying Firehose increases the likelihood of creating the "small file problem"? Firehose makes it easy to do so (when the thresholds are too low, as you point out). That is, it'd happily chug along and do what you ask of it to. Sometimes, these problems only manifest in the long run (kind of like a frog in boiling water). > Also, why would you run a daily batch job to coalesce all these files into parquet files instead of letting Firehose just do that for you. Firehose recommends that the output be at least 64M to 128M for parquet files... we don't have anywhere near that much amount of data to yeet out of Firehose, especially because data is partitioned per-user (and a single user doesn't generate anywhere near that much data, and so we're left with the current setup). And so: It was either to let Firehose batch the data up in larger parquets (and run the partitioning job offline), or employ its partitioning magic online (and run the merge job offline, on-demand). We chose the latter for cost efficiency given our workloads. |
|