Hacker News new | ask | show | jobs
by hinkley 2220 days ago
> You can't expect kubernetes scheduler to enforce anti-affinites for your pods. You have to define them explicitly.

Why isn't this the default behavior? Why don't I have to go in and tell it that it's okay to have multiple instances on the same node? Why? So that I somehow feel like I've contributed to the whole process by fixing something that never should break in the first place?

I know of a few pieces of code where I definitely want to run N copies on one machine, but for all of the rest? Why am I even running 2 copies if they're just going to compete for resources?

3 comments

Simply put, what the article recommends is in most situations, dead wrong advice.

If configured as suggested, and for some reason you lose enoguh nodes in your cluster for not having single node for each of your replicas, you will have less replicas than you intended, since the scheduler can't schedule a pod to a node where identical pod is already running.

Additionally, the affinity and anti affinity features are costly from the cluster perspective, so the configuration recommended by the author cost you performance.

And why isn't Kubernetes doing the obvious thing and spread apart your pods? well, it's simple - it does the right thing:

https://v1-15.docs.kubernetes.io/docs/concepts/scheduling/ku...

The first scoring parameter is SelectorSpreadPriority.

It's quite possible that you have a machine with 192 CPU cores in it, but it's very unlikely that you are able to write a service that scales to that level ... and if you write it in Go it's really unlikely that you can scale even to 8 CPUs. There's nothing weird about having multiple replicas of the same job on the same node. If you look through the Borg traces that Google recently published you can find lots of jobs with multiple replicas per node.
This is not how defaults work.

When you are talking about the realm of the possible, you provide settings that allow you to reach the scenarios that you feel are reasonable, desirable, or lucrative (or commonly enough, some happy combination of the three).

Defaults are the realm of the probable. And nobody is requisitioning a 192 core machine without a good bit of due diligence, which would include deciding how to set server affinity.

You're suggesting that preventing multiple replicas of the same job to schedule on the same machine as a good default. There's no evidence to support your conclusion, and my experience it quite the opposite. It is much better if people running batch jobs just schedule 100000 tiny replicas, and let the scheduler sort it out. This provides the cluster scheduler with plenty of liquidity. Multiple small processes are more efficient than a shared-nothing single process.
Still the same question.

Do you think that batch processing is the default activity in Kubernetes, or something that people find after they are familiar with the system?

Yes, I think batch workloads are the most common workloads, in resource-weighted terms, among k8s users.
You're being slippery, which comes across as dishonest.

Why does the resource weight have anything to do with the choice of defaults? Settings don't care how often they are read, they only care how often they are set. Large jobs use a disproportionate amount of total resources, sure, but they are tiny uptick in total configuration.

The stakes are higher, but so is the 'budget' for getting things right. I can deploy 5 servers and just wait to see what happens. If I'm doing an overnight job to process a billion records, I'd better be doing some due diligence beforehand, or I have nobody to blame but me. And the failure mode here is that I didn't spend money fast enough to get the job done.

With the current defaults what happens is I blow my monthly budget in one night. Which is very convenient for the vendor, but not convenient for my company.

"It is difficult to get a man to understand something when his salary depends upon his not understanding it." - Upton Sinclair

Pod anti affinities did historically dramatically increase scheduling times. Not sure this is the primary reason, but probably one