Hacker News new | ask | show | jobs
by dboreham 188 days ago
Typically they're overprovisioning through prior experience. In the past something fell over because it didn't have enough memory. So they gave it more. That practice stuck in the brain-model. Perhaps it's no longer valid, but who wants to bring down the service doing Chernobyl experiments?
3 comments

You run these tools and you find all the maximums for weeks of traffic and so you set them down to minimise cost and all is well until the event. The event doesn't really matter, it causes an increase in traffic processing time and suddenly every service needs more memory to hold all transactions, now instead they fail with out of memory and disappear and suddenly all your pods are in restart loops unable to cope and you have an outage.

The company wasting 20% extra memory on the other hand is still selling and copes with the slower transaction speed just fine.

Not sure over provisioning memory is really just waste when we have dynamic memory based languages, which is all modern languages not in real time safety critical environments.

This acts like one app is the only app running on device, which in the case of k8s, clearly isn't the case.

If you want to get scheduled on a node for execution after a node failure, your resource requests need to fit / pack somewhere.

The more accurately modeled application limits are, the better cluster packing gets.

This impacts cost.

> Perhaps it's no longer valid, but who wants to bring down the service

I'm thinking more like getting a junior to do efficiency passes on a year's worth of data.

Exactly we call it sleep insurance. It is rational for the on call engineer to pad the numbers but it's just irrational for the finance team to pay for it.