Hacker News new | ask | show | jobs
by Denzel 1238 days ago
Yes, shipping computations instead of data is a reasonable design goal. Your proposed system only works when the predicate is independent across all logs though, correct? If you have to correlate or join your logs to anything, then this model becomes more complex. Not to mention, you're adding an additional performance tax to your prod machines which could be more costly than shipping logs to a centralized store. (A team should profile and make a tradeoff decision appropriate to their context.)

Additionally, what happens when we want to correlate these logs with tens of other systems?

I guess I don't agree that distributed log analysis simplifies the problem any more than centralized log analysis does. If the primary concern is cost, then you can save equivalent amounts of money with a different lifecycle policy for centralized logs.

EDIT: Btw, don't get me wrong, you are asking the right questions that HubSpot's performance team should be asking. The first phase of a cost savings program should observe benefits against cost, or stated another way, requirements vs cost. You're asking the right question, i.e., uhm, how do we actually use this data after we log it? I find it striking that this cost analysis didn't say anything about the end-user's use cases or benefits. Sure, we can optimize a system and save 40% the cost, but what if no one is using the system? Then we could save 100% the cost.

2 comments

I once worked on a system where we were told to keep 18 months of debug logs (something that would have cost about $2k/month). When we pushed back and asked why, the answer was that occasionally (every month or every other month) there would be some customer issue that would need investigation that might result in a customer refund of $20-50 dollars.

Setting aside that the human time required for the investigation was probably close to $40-50, it was still not a slam dunk to get the business to shrink retention to a few days for critical debug.

Like markets... executives can stay irrational longer than you can remain sane, sometimes.
Anything that could lead to a customer revenue dispute is a critical audit log and needs to go to gold-plated log storage. But you will also be paying attention to optimizing costs for that, and the volumes will be relatively rare compared to application informational logs.
Seems like the kind of situation where you shrug, agree, compress and ship those logs off to cold storage to meet the requirement for a fraction of the price.
That's a fair criticism in the edit. Part 2 will cover that a bit more. I did run analysis on the types of queries users ran against the data and what parts of the timeseries were used, which informed a bit of our solution. I don't want to give away too much, but lifecycle retention adjustment ends up being relatively lower value (but still worthwhile) compared to general space savings.
Thanks for the response. I was responding to the comment more so than advocating for adjusting your log retention. :) Looking forward to part 2.

Are you able to reconcile some of the numbers and calculations in the article for me? (Understanding that you don't want to reveal any confidential info.) I see:

- 31 PB data + 10 PB application logs = 41 PB logs (uncompressed json) costs 7-figures (say ~$5M)

- 41 PB logs * 5% ORC compression = ~ 2 PB logs (compressed ORC) costs low 6-figures (say ~$300k)

I don't know what timeframe that cost is measured over. But that brings us to $300k / 2 PB = $0.15 / GB which is far above S3's quoted costs so I must be missing something.

The costs I talk about in the estimate in this post are for the remaining cost of each stored file. We have S3Inventory dumping metadata of all the files in specific buckets weekly, so I had written a job that calculated the exact remaining cost of each file, accounting for lifecycle events like moving to infrequent access storage in S3 and the eventual deletion of the file. So it’s sort of the “potential energy” version of the cost of our stores files. If we take no action they will aggregate to a certain amount of money.

I reckon you may be looking at the monthly cost of storage per gigabyte which is why the number doesn’t seem to make sense. Our retention policy started off at about 2 years, so the remaining lifetime per file amortizes out to much more than 1 month.

Also worth considering that we have a custom AWS contract, so none of our actual numbers are the publicly advertised rates and probably won’t entirely math out if you try to ballpark with those numbers.

Thanks for the clarification!