Hacker News new | ask | show | jobs
by perryizgr8 51 days ago
I think I read somewhere that calculating and limiting cloud usage costs is a really hard problem. But I feel that if Google were motivated to do it, they can do it. It's hard, not impossible. They just don't care to solve this particular problem.
5 comments

If they can COUNT it and charge based on that, that means they can count it and react.

If I, not having their budget or engineers, can have pretty much instant Prometheus event reacting to metrics, surely it wouldn't be too hard for them to have triggers like this -- somehow their AI can automatically ban people based on something, can't they do something for the customers?

They can, just don't want to.

In the article it states that this person had an account that would have been limited to $2000 in usage.

And the system automatically upgraded them to higher spending limits when they crossed the $1000 in usage costs.

They could definitely make that an opt-in feature.

Yea, makes no sense for it to be opt out. Otherwise it just means there are no limits.
It's the same fundamental problem as view counters, something Google is famously good at solving. Eventually consistent solutions are well-understood, and wouldn't have these kinds of massive cost-overruns.
It's more a problem they are incentivized to have. Open Router allows fixed wallets and doesn't run into the same problem, since it would be their money on the line if they let a user overspend their limits.
Depends on latency. 24 hour delays on an eventually consistent counter used for billing absolutely would cause this problem.
It seems hard to believe that a one-hour delay on such a counter is impossible to achieve, and one hour would reduce the risk from "catastrophic" to "serious problem" in most cases.

Also, if implementing a cap is a desired feature that justifies trade-offs to be made, then it is psosible to translate the budget cap (in terms of money) back into service-specific caps that are easier to keep consistent. Such as "autoscale this set of VMs" and "my budget cap is $1000/hour", with the VM type being priced at $10/hour, translated to "autoscale to at most 100 instances". That would need dev work (i.e. this feature being considered important) and would not respect the budget cap in a cross-service way automatically, but still it is another piece in the puzzle.

Eh, suddenly turning off all services in your account because you hit your cap is just as much a DoS type event - just of your services, not your wallet.
So? Many would prefer a DoS-type event over spending $WHATEVER_THEIR_HARD_CAP_IS. This is kinda the definition of a hard cap, so you would place it sufficiently high that DoSing your system is indeed preferable.

Also, doing this on a per-service basis doesn't seem that far-fetched to me, so you'd only kill that service and get at least some chance that the rest of your system remains usable.

It’s the trade offs.

If you have an actual enforced cap, those services will be disabled until you resolve the cap - which depending on the latency for usage updates, may be hours after you pass the cap, and hours after you resolve the issue.

Or you have ‘warnings’, and your services keep working, but you spend more $$.

Previously, people seemed to be more worried about service outages than raw $$. Now it’s the other way around.

It’s a common issue with disk quotas in on-prem systems too, and they tend to cause a lot of similar types of problems in both directions.

Yeah, there's an implicit assumption was reasonability.

But a big part of the value in large clouds like GCP is the network's interconnectedness. Plus even if there was some global event that made communications impossible only for the billing service, I'd still expect charges to top out roughly proportional to the number of partitions as they each independently exceed the threshold. GCP only has 120ish zones.

It’s hard on AWS as well, but I agree. There’s just no incentive for the billing experience to be better.
aws, gcp, azure (the ones I work with), they don't provide a off the shelf solution to block after some budget ammount. This is not aceptable.
They charge for a lot of things "by the hour". Things like S3, load balancers, storage.

Deleting those when a customer hits a limit will lose customer data or remove things that might be hard to add back. The "I hit my AWS limit and they deleted all my data" headlines will result.

and excluding those things makes the limit soft again..

Maybe relying on one company to store all the data your company has is a terrible idea
I mean yes, look at Corey Quinn [1] for example. He has built an entire career out of the fact that cloud billing trips people up.

(Generally, tech seems to skate by on creating insanely complicated things, knowing that given enough pain, people will start blogging about their solutions, ie effectively outsourcing the cost and effort of doing something about it.)

[1] https://www.lastweekinaws.com/

Tech skates by on monopoly/oligopoly power. This arises because big players are allowed to buy competitors whenever they like. And since they are already monopolies/duopolies, they have unlimited money for such purposes. Killing off WhatsApp was chump change for Facebook.

We essentially don’t have monopoly enforcement in the US anymore