Hacker News new | ask | show | jobs
by jhoelzel 1376 days ago
In my experience, HPA are awesome! Once you defined your sweetspot of buffer pods for quick scaling, they are well worth the effort!

It the super simple stuff, scaling down staging on the weekend or even scaling all feature deployments to 0, when you know nobody will be working on it, that will end up saving you big bucks on your cloud budget.

If you pair the HPA with a decent node autoscaler, THAT in my opinion is the game changer of cloud managed kubernetes over the bare metal deployments that I have done.

2 comments

I'm surprised to hear you say that you like having two layers of autoscaling rather than it being some accidental complexity that you just have to put up with because of how the different systems intersect. Having multiple non-orthogonal dimensions of scaling to me always feels like a task ill-suited to humans. I kinda wish HPA, VPA, and Cluster Autoscaler were all just be one thing.
I think the boundary makes a lot of sense. Cluster autoscaling only responds to scheduling pressure; if there are pending pods, a new node is added to the cluster so those pods can run. Meanwhile, horizontal pod autoscaling is a totally different system; it adds pods for that service when system-level metrics indicate that it should. Vertical pod autoscaling is again mostly unrelated; if metrics indicate that a certain pod should be bigger, a bigger version is scheduled.

I do see why more integration would be useful, though, including disruption budgets. Mostly for consolidating the incremental cluster autoscaling results onto one node from time to time, without waiting for the workload to naturally disappear or decrease in scale. Also, it would be nice to say "hey if ARM spot nodes are cheaper than AMD64, just reschedule these workloads onto ARM". Basically, it's still the very early days of optimizing cost, latency, and throughput.

The cluster autoscaler will do pod compaction. It would be nice to specify when to favor more compaction than expansion because you know the traffic is going to fall off after a certain time during the day.

The main thing the integration helps with is reducing the startup time when there is scheduling pressure. If you know your increase in number of pods will always mean an expansion in the nodegroup, you can proactively and optimistically expand the nodegroup.

What's a case where you need to scale up pods but not nodes? (I think the case where you need to scale vertically vs horizontally is easily imagined, though)
Anything that horizontally scales. Remember that scheduling pressure doesn't only add nodes, you can also preempt lower priority workloads.

So maybe you have an application server that uses 1 CPU and 1G of RAM per instance, and can handle 10,000 requests per second. If you are getting 30,000 requests per second, then you'll want at least 2 more replicas to handle the load with an acceptable response time. You also run fuzz tests in the background at a very low priority. So scheduling 2 more application server replicas will causes those jobs to be preempted, and give your application 3 CPUs and 3G of RAM.

Basically, with this sort of autoscaling, you are always using 100% of your computers for something, but when there is some business to do or money to be made, you can give the revenue critical stuff priority.

As always, there are technicalities as to why you wouldn't want to do this. Maybe you think your fuzz testing is going to find a container escape and destroy the VM that it's running on, so you don't want production traffic anywhere near it. Or maybe the 10,000 requests per second that your application server can handle with 1 CPU actually uses all of the network capacity on the node, so you have to scale across other machines in order to handle any more requests. It all depends, but the flexibility is there to get yourself high utilization of your physical hardware.

If you have some spare capacity on another node.

This can happen organically due to scaling different systems at different times.

During the day you do a lot of user load, during the night you do a lot of batch processing, as one workload scales down another can scale up: without needing more virtual machine instances.

Of course a complete analysis needs something like kubecost running for a couple of weeks to determine where your peaks really are and a couple more for actual fine tuning, but i think its well worth it in the end.

Node autoscaling works best for me with buffer nodes depending on resources and having "one more than you need" is super easy in the cloud.

Dont get me wrong there is still plenty of room for improvenment, but the hard part is defenitly just finding out how much resources your app really needs.

And of course, the application needs to be able to handle scaling to begin with.

It's more reliable this way. N number of pods are scheduled on M number of nodes, and if there are multiple sets of pods that each have their own scaling parameters (target utilizations, scaling cooldowns, etc), there is not always a one-to-one mapping with how many nodes are needed.

The cluster autoscaler already has a fairly complex logic just in its own control loop. It uses predicate logic and a simulated scheduler to determine whether a pending pod, based upon node selector, affinity, anti-affinity, taints, tolerations, qos, priority whether expanding a nodegroup would make the pod schedulable.

So it's actually easier (at least for me) to reason out what might happen, with two control loops that work independently in adjacent dimensions than a single one that tries to cover everything. I would not want HPA, VPA, and cluster-autoscaler to be one thing.

I have never used VPA, and in our use-case, we do a different kind of vertical scaling. (Different deployments that target different nodegroups with differently-sized number of cores on the base machine)

Look at Karpenter. It's working really well for some of our workloads.
> I kinda wish HPA, VPA, and Cluster Autoscaler were all just be one thing.

Go and write it, there are a bajillion open source controllers for Kubernetes that add a ton of value.

> Having multiple non-orthogonal dimensions of scaling to me always feels like a task ill-suited to humans.

It's ill-suited for anyone unless P=NP with a nice solution.

> Having multiple non-orthogonal dimensions of scaling to me always feels like a task ill-suited to humans.

That plus a steep learning curve leads to Stockholm syndrome.

That's what we did where I work. Thanks to GKE + HPA + cluster-autoscaler, our cluster grows at the same time as our req/s.
Same.

We broke it at least once but now it’s fixed.