Hacker News new | ask | show | jobs
by Denzel 1238 days ago
Thanks for the response. I was responding to the comment more so than advocating for adjusting your log retention. :) Looking forward to part 2.

Are you able to reconcile some of the numbers and calculations in the article for me? (Understanding that you don't want to reveal any confidential info.) I see:

- 31 PB data + 10 PB application logs = 41 PB logs (uncompressed json) costs 7-figures (say ~$5M)

- 41 PB logs * 5% ORC compression = ~ 2 PB logs (compressed ORC) costs low 6-figures (say ~$300k)

I don't know what timeframe that cost is measured over. But that brings us to $300k / 2 PB = $0.15 / GB which is far above S3's quoted costs so I must be missing something.

1 comments

The costs I talk about in the estimate in this post are for the remaining cost of each stored file. We have S3Inventory dumping metadata of all the files in specific buckets weekly, so I had written a job that calculated the exact remaining cost of each file, accounting for lifecycle events like moving to infrequent access storage in S3 and the eventual deletion of the file. So it’s sort of the “potential energy” version of the cost of our stores files. If we take no action they will aggregate to a certain amount of money.

I reckon you may be looking at the monthly cost of storage per gigabyte which is why the number doesn’t seem to make sense. Our retention policy started off at about 2 years, so the remaining lifetime per file amortizes out to much more than 1 month.

Also worth considering that we have a custom AWS contract, so none of our actual numbers are the publicly advertised rates and probably won’t entirely math out if you try to ballpark with those numbers.

Thanks for the clarification!