| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lazide 54 days ago
	Depends on latency. 24 hour delays on an eventually consistent counter used for billing absolutely would cause this problem.

2 comments

moring 54 days ago

It seems hard to believe that a one-hour delay on such a counter is impossible to achieve, and one hour would reduce the risk from "catastrophic" to "serious problem" in most cases.

Also, if implementing a cap is a desired feature that justifies trade-offs to be made, then it is psosible to translate the budget cap (in terms of money) back into service-specific caps that are easier to keep consistent. Such as "autoscale this set of VMs" and "my budget cap is $1000/hour", with the VM type being priced at $10/hour, translated to "autoscale to at most 100 instances". That would need dev work (i.e. this feature being considered important) and would not respect the budget cap in a cross-service way automatically, but still it is another piece in the puzzle.

link

lazide 53 days ago

Eh, suddenly turning off all services in your account because you hit your cap is just as much a DoS type event - just of your services, not your wallet.

link

moring 52 days ago

So? Many would prefer a DoS-type event over spending $WHATEVER_THEIR_HARD_CAP_IS. This is kinda the definition of a hard cap, so you would place it sufficiently high that DoSing your system is indeed preferable.

Also, doing this on a per-service basis doesn't seem that far-fetched to me, so you'd only kill that service and get at least some chance that the rest of your system remains usable.

link

lazide 52 days ago

It’s the trade offs.

If you have an actual enforced cap, those services will be disabled until you resolve the cap - which depending on the latency for usage updates, may be hours after you pass the cap, and hours after you resolve the issue.

Or you have ‘warnings’, and your services keep working, but you spend more $$.

Previously, people seemed to be more worried about service outages than raw $$. Now it’s the other way around.

It’s a common issue with disk quotas in on-prem systems too, and they tend to cause a lot of similar types of problems in both directions.

link

AlotOfReading 54 days ago

Yeah, there's an implicit assumption was reasonability.

But a big part of the value in large clouds like GCP is the network's interconnectedness. Plus even if there was some global event that made communications impossible only for the billing service, I'd still expect charges to top out roughly proportional to the number of partitions as they each independently exceed the threshold. GCP only has 120ish zones.

link