Hacker News new | ask | show | jobs
by logicfiction 1235 days ago
(Author here). Yes I believe you are correct with regards to tracking application utilization of say EC2 and other AWS resources.

The post fails to mention this system is also tracking internal data dimensions like customer ids, such that we can also use this sampled data to estimate the cost of customers (and joining that with tiers of customers, and so forth).

I'm also not sure if that would allow us to attribute the cost of our datastore utilizations since those are not AWS-hosted versions but ones we run ourselves. The traffic interception lets us be able to say that Application A is using 75% of database cluster XYZ, and therefore that application/product group are most likely responsible for that share of how much the database costs.

The last thing I'll mention is that CloudTrail has the potential to be expensive on its own, I believe at least moreso than us storing the raw data in S3 for something like Athena to read. I don't think I'll be writing about it, but we've also done work this last year to trim down what we track in CloudTrail due to the cost of events (for example tracking everything in S3 ends up being pretty expensive).

1 comments

When it comes to shared resources like a database cluster, you’re making the assumption that usage is correlated with number of connections.

Is this always true? Typically the shared resources you care about are CPU, memory and disk. I would say an application issuing fewer, much heavier queries is using the shared resource more than an application that issues more really simple queries. And this doesn’t correlate much to disk usage right?

There isn’t really a good solution to this. You can use a combination of query sampling and per-app databases to correlate this better.

Great post though, this is something we’ve been dealing and experimenting with recently.

Your observations are correct. I wouldn’t portray it as an ideal system, just best effort. In the end we care more about the finer details being good funnels to follow up on rather than being exact. We know our real costs of resources, which is important for finance and budget. And then we have the approximate attributions from the sampling which narrow things down enough to focus diagnosis when needed.

I would have to read more into how it intercepts some of our database calls to confirm if it tries to weight for execution time where it wires into database client code, which is probably useful and could help to a degree to approximate utilization.

I think in practice it’s a bit uncommon for the heaviest user to also be a sparse user in terms of volume. But I can also admit there are quirks to how it samples and I once personally spent a couple days tracking down a surprising cost of an application I owned only to later confirm it was a data flaw in how we were doing this sampled attribution (in this case the heaviest users were un-instrumented infra processes that can’t just wire in our java cost attribution library, making it artificially look like my app was the heaviest user).