Hacker News new | ask | show | jobs
by sm_ts 1666 days ago
I've maintained a QEMU fork with pinning support, and even coauthored a research paper on the Linux pinning performance topic, and the results have been... underwhelming; "sadly" the Linux kernel does a pretty good job at scheduling :)

I advise pinning users to carefully measure the supposed performance improvement, as there is a tangible risk of spending time on imaginary gains.

7 comments

I found the most gains in terms of... latency consistency. I had a VM with a GPU passed through for gaming. With the cores appropriately pinned, especially away from host tasks, there were no more random DPC latency spikes.

With no pinning they'd randomly go into the milliseconds -- with pinning it would stay in the micro second range!

The result of this is games (and likely audio) performing much more favorably.

How much of this is cache coherency/in-fighting, scheduling, or simply host usage; I couldn't tell you. I was just happy to have my VM 'feel' native.

There will always be a benefit with pinning vCPUs on the same NUMA nodes as their devices (VFIO or even SR-IOV). This is becoming increasingly important on hypervisors

In a setup with high-level of containers collocation on large ec2 instances, we've seen the opposite behavior at Netflix: default CFS performing badly. We've AB tested our flavor of custom pinning and measured substantial benefits: https://netflixtechblog.com/predictive-cpu-isolation-of-cont...

PMC data at scale is pretty clear: very often, CFS won't do the right thing and will leave bad HT neighbors on the same core, leading to L1 thrashing, or keep a high-level of imbalance between NUMA sockets leading to degraded LLC hit rate.

Thanks, that's a very interesting case.

I correct my statement with "_did_ a good job", and appreciate rigorous testing.

Not sure how you maintaining QEMU makes you a credible source for evaluating a schedulers performance. It's apparent to me the performance of the scheduler is a function of the workload, so YMMV.

I worked on a project where we collected detailed production runtime characteristics and evaluated scheduler algorithms against it. Tiny improvements made for massive savings.

I definitely correct my "does" a good job with "did" a job. But ultimately, I've advised a good deal of caution, which I think is fair, in particular, considering that only a small fraction of the companies has a compute scale where tiny improvements make massive savings.
At my last job we initially saw performance loss due to pinning; I think multiple QEMU I/O threads got pinned to a single CPU. It's very easy to do it wrong.
I have looked around a bit, complicated to get right, very lite performance gains, most people doing it for gaming report
YMMV. We've seen M$ worth of cloud savings at Netflix doing pinning right. Knowing that the task scheduler is also heavily forked in Google's kernel, I'm ready to bet they've seen order of magnitude higher savings in their own DCs as well.
Agreed, in my case it became very useful on large boxes (96 physical cores). The performance gain was about 10%.
Would you mind sharing the paper on pinning? I'd be interested
Hello! I'll write you via email.