Hacker News new | ask | show | jobs
by ignoramous 1589 days ago
> It's the users fault for misusing the service.

I believe, AWS' usage-based billing make for long-tail surprises because its users are designing systems exactly as one would expect them to. For example, S3 is never meant for a bazillion small objects which Kinesis Firehose makes it easy to deliver to it. In such cases, dismal retrieval performance aside [0], the cost to list/delete dominate abnormally.

We spin up a AWS Batch job every day to coalesce all S3 files ingested that day into large zlib'd parquets (kind of reverse VACCUM as in postgres / MERGE as in elasticsearch). This setup is painful. I guess the lesson here is, one needs to architect for both billing and scale, right from the get go.

[0] https://news.ycombinator.com/item?id=19475726

1 comments

Perhaps I don't fully understand the nuances of what you're trying to do, but...

> S3 is never meant for a bazillion small objects which Kinesis Firehose makes it easy to deliver to it

Are you saying Firehose increases the likelihood of creating the "small file problem"?

If so, isn't this exactly what Firehose tries to prevent? Sure, you can set all the thresholds low and unnecessarily generate lots of small files, but you can tune those thresholds to maximize record/file size and attain a reasonable latency. If there's a daily batch job to make this data useful, then who cares about latency?

Also, why would you run a daily batch job to coalesce all these files into parquet files instead of letting Firehose just do that for you. It can also do a certain amount of partitioning if it's required.

> Are you saying Firehose increases the likelihood of creating the "small file problem"?

Firehose makes it easy to do so (when the thresholds are too low, as you point out). That is, it'd happily chug along and do what you ask of it to. Sometimes, these problems only manifest in the long run (kind of like a frog in boiling water).

> Also, why would you run a daily batch job to coalesce all these files into parquet files instead of letting Firehose just do that for you.

Firehose recommends that the output be at least 64M to 128M for parquet files... we don't have anywhere near that much amount of data to yeet out of Firehose, especially because data is partitioned per-user (and a single user doesn't generate anywhere near that much data, and so we're left with the current setup). And so: It was either to let Firehose batch the data up in larger parquets (and run the partitioning job offline), or employ its partitioning magic online (and run the merge job offline, on-demand). We chose the latter for cost efficiency given our workloads.