Hacker News new | ask | show | jobs
What Is Kubernetes HPA and How Can It Help You Save on the Cloud? (cast.ai)
35 points by deletriusotis 1376 days ago
5 comments

In my experience, HPA are awesome! Once you defined your sweetspot of buffer pods for quick scaling, they are well worth the effort!

It the super simple stuff, scaling down staging on the weekend or even scaling all feature deployments to 0, when you know nobody will be working on it, that will end up saving you big bucks on your cloud budget.

If you pair the HPA with a decent node autoscaler, THAT in my opinion is the game changer of cloud managed kubernetes over the bare metal deployments that I have done.

I'm surprised to hear you say that you like having two layers of autoscaling rather than it being some accidental complexity that you just have to put up with because of how the different systems intersect. Having multiple non-orthogonal dimensions of scaling to me always feels like a task ill-suited to humans. I kinda wish HPA, VPA, and Cluster Autoscaler were all just be one thing.
I think the boundary makes a lot of sense. Cluster autoscaling only responds to scheduling pressure; if there are pending pods, a new node is added to the cluster so those pods can run. Meanwhile, horizontal pod autoscaling is a totally different system; it adds pods for that service when system-level metrics indicate that it should. Vertical pod autoscaling is again mostly unrelated; if metrics indicate that a certain pod should be bigger, a bigger version is scheduled.

I do see why more integration would be useful, though, including disruption budgets. Mostly for consolidating the incremental cluster autoscaling results onto one node from time to time, without waiting for the workload to naturally disappear or decrease in scale. Also, it would be nice to say "hey if ARM spot nodes are cheaper than AMD64, just reschedule these workloads onto ARM". Basically, it's still the very early days of optimizing cost, latency, and throughput.

The cluster autoscaler will do pod compaction. It would be nice to specify when to favor more compaction than expansion because you know the traffic is going to fall off after a certain time during the day.

The main thing the integration helps with is reducing the startup time when there is scheduling pressure. If you know your increase in number of pods will always mean an expansion in the nodegroup, you can proactively and optimistically expand the nodegroup.

What's a case where you need to scale up pods but not nodes? (I think the case where you need to scale vertically vs horizontally is easily imagined, though)
Anything that horizontally scales. Remember that scheduling pressure doesn't only add nodes, you can also preempt lower priority workloads.

So maybe you have an application server that uses 1 CPU and 1G of RAM per instance, and can handle 10,000 requests per second. If you are getting 30,000 requests per second, then you'll want at least 2 more replicas to handle the load with an acceptable response time. You also run fuzz tests in the background at a very low priority. So scheduling 2 more application server replicas will causes those jobs to be preempted, and give your application 3 CPUs and 3G of RAM.

Basically, with this sort of autoscaling, you are always using 100% of your computers for something, but when there is some business to do or money to be made, you can give the revenue critical stuff priority.

As always, there are technicalities as to why you wouldn't want to do this. Maybe you think your fuzz testing is going to find a container escape and destroy the VM that it's running on, so you don't want production traffic anywhere near it. Or maybe the 10,000 requests per second that your application server can handle with 1 CPU actually uses all of the network capacity on the node, so you have to scale across other machines in order to handle any more requests. It all depends, but the flexibility is there to get yourself high utilization of your physical hardware.

If you have some spare capacity on another node.

This can happen organically due to scaling different systems at different times.

During the day you do a lot of user load, during the night you do a lot of batch processing, as one workload scales down another can scale up: without needing more virtual machine instances.

Of course a complete analysis needs something like kubecost running for a couple of weeks to determine where your peaks really are and a couple more for actual fine tuning, but i think its well worth it in the end.

Node autoscaling works best for me with buffer nodes depending on resources and having "one more than you need" is super easy in the cloud.

Dont get me wrong there is still plenty of room for improvenment, but the hard part is defenitly just finding out how much resources your app really needs.

And of course, the application needs to be able to handle scaling to begin with.

It's more reliable this way. N number of pods are scheduled on M number of nodes, and if there are multiple sets of pods that each have their own scaling parameters (target utilizations, scaling cooldowns, etc), there is not always a one-to-one mapping with how many nodes are needed.

The cluster autoscaler already has a fairly complex logic just in its own control loop. It uses predicate logic and a simulated scheduler to determine whether a pending pod, based upon node selector, affinity, anti-affinity, taints, tolerations, qos, priority whether expanding a nodegroup would make the pod schedulable.

So it's actually easier (at least for me) to reason out what might happen, with two control loops that work independently in adjacent dimensions than a single one that tries to cover everything. I would not want HPA, VPA, and cluster-autoscaler to be one thing.

I have never used VPA, and in our use-case, we do a different kind of vertical scaling. (Different deployments that target different nodegroups with differently-sized number of cores on the base machine)

Look at Karpenter. It's working really well for some of our workloads.
> I kinda wish HPA, VPA, and Cluster Autoscaler were all just be one thing.

Go and write it, there are a bajillion open source controllers for Kubernetes that add a ton of value.

> Having multiple non-orthogonal dimensions of scaling to me always feels like a task ill-suited to humans.

It's ill-suited for anyone unless P=NP with a nice solution.

> Having multiple non-orthogonal dimensions of scaling to me always feels like a task ill-suited to humans.

That plus a steep learning curve leads to Stockholm syndrome.

That's what we did where I work. Thanks to GKE + HPA + cluster-autoscaler, our cluster grows at the same time as our req/s.
Same.

We broke it at least once but now it’s fixed.

HPAs can definitely save you a lot of money when running Kubernetes and they are extremely useful, especially for non-production environments where you want to be efficient as possible.

Strategies I have used in the past for saving money are:

  1) Set requests very low for your pods. Look at the minimum CPU/Memory that your pods need go start and set it to that. Limits can be whatever.

  2) Set min replicas to 1. This is a non-production environment, nobody cares if an idle pod goes away in the middle of the night.

  3) Use spot instances for your cluster nodes. 80% savings is nice!

  4) Increase the number of allowed pods per node. GKE sets the default to 110 pods per node but it can be increased.

  5) Evaluate your nodes and determine if it makes more sense to have `fewer large sized nodes` or `several smaller nodes`. If you have a lot of daemonsets then maybe it makes sense to have fewer large nodes.

  6) Look at the CPU and Memory utilization of your nodes. Are you using a lot of CPU but not much memory?  Maybe you need to change the machine type you are using so that you get close(r) to 100% CPU and Memory utilization. You are just wasting money if you are only using 50% of the available memory of your nodes.

  7) Use something like knative or KEDA for 'intelligent autoscaling'. I've used both extensively and I found KEDA to be considerably simpler to use. Being able to scale services down to 0 pods is extremely nice!
> Set requests very low for your pods. Look at the minimum CPU/Memory that your pods need go start and set it to that. Limits can be whatever.

Wouldn't this lead to Node Over provisioning?

I ask because my company's workload is very spiky and usage is very minimal until it isn't. We are looking into ways to optimize it.

My example was for non-prod and saving money there as I found that our development clusters tended to be the most under utilized per dollar spent. In development it was ok to put as many idle pods as possible on the nodes. If there was a spike, then yes you could get new nodes but I found that they scaled down nodes quite often.

My apologies in advance as the advice can be terrible depending on your environment and services. Below is not an exact science as you are dealing with requests and limits while trying to find optimal performance.

For production you need to calculate your minimum, average and max CPU/Memory for your a pod.

  1) Set your replicas to 1

  2) Determine what your true maximum CPU/Memory is for a pod. 

  Set your limits to very high and performance test against your pod. If your response time slows to a crawl then your limit is too high and your code may not be able to handle the load. If your response time is good while hitting the limit, increase the limit until performance goes down.

  3) Get your minimum CPU/Memory for your pod to start.

  5) Get your average CPU/Memory DURING THE SPIKES. You should be able to get this from past metrics. This can also be difficult to get because your load might be spread over several pods in your metrics.

  6) I use the following formulas:

     requests = (min + average)/2
     limits = (average + max)/2

  7) You now have a baseline for the future so that you can tweak the values.

  8) Set your autoscaler to something high like 80% CPU. You want this value to stay constant. I think GKE sets it to 60% but I found that to be far too low and wasteful.

  9) Observe and tweak the values to see if you can get things 'better' depending on your needs.

There are two other things I always do in production that help with stability and reliability.

  - Set the autoscaler behaviour to scale up quickly and scale down slowly. It stops these cycles of add 3 pods, remove 1 pod, add 2 pods, remove 3 pods chaos in short periods of time during spikes. The behavior field was added to the autoscaler resource a couple releases ago.

  - Set your minimum replicas to 2 for redundancy. I always do this in production.
I hope this helps and I apologize once again for the hand wavyness of things.
Highly recommend checking out KEDA (https://keda.sh/) which leverages HPAs under the hood.

If you need to scale based on some internal data like database records, Redis queues, Kafka topics, etc. KEDA scalers are incredibly easy to hook up to do that. You could even write your own custom scaler if there is no existing one for your type of event data source.

If the author is here: The illustration in "How does Horizontal Pod Autoscaler work?" section has incorrect before/after CPU utilization % based on the text/logic.
I'm curious if anyone has found a sweet spot for autoscaling ingress gateways in terms of CPU% saturation. I found tail latencies start to get high over 60%.