Hacker News new | ask | show | jobs
by PaulDavisThe1st 1733 days ago
Although we are still debating with some smart people about this (since actual measurements call it into question), this was our (Ardour dev team's) take on this question:

https://ardour.org/plugins-in-process.html

1 comments

The answer is "too much overhead" but the overhead isn't coming from where I assumed it would. I thought it could be too expensive to pass the required amount of data through the kernel (at least 48000 samples per track per second), but that's not the problem, it turns out it's the context switches. Huh.

edit: I now also remembered that virtual memory is a thing and you can share a chunk of physical memory between processes to avoid the need to copy anything at all.

Context switches on Linux are a pretty heavy affair, this is the result of some choices in the distant past when the number of context switches per second was much higher than on most other platforms and so it was deemed to be 'good enough'. Unfortunately this rules out a lot of real time or near real time work, especially when the workload is divided over multiple user space processes.
A citation is required for this claim.

I know of no evidence that Linux context switching on x86/x86_64 is slower than any other OS, and some suggestions that it is faster (Linux does not save/restore FP state, which Windows (at least at one point) does).

Linux is as capable or more capable of realtime work than any other general purpose OS, and the latency numbers from actual measurement are excellent (when using RT_PREEMPT etc).

What are you referring to?

Back in the day when the Linux kernel was first written there was a huge argument between Linus and Andrew Tanenbaum about whether or not the micro or macro kernel road was the superior one.

Tanenbaum argued that a microkernel was lighter, and could switch context faster than a macrokernel (the likes of which UNIX was typically reincarnated with). Linus argued that throughput, not latency is what matters to end users. At that time your typical OS switched tasks 18.5 times per second and Linux did substantially better than that. Case closed, the throughput argument won.

But now, many years later the consequences of that mean that we are switching contexts orders of magnitude slower than we could have because the context contains a lot more information than it strictly speaking has to. My own QnX clone switched 10K / second on a 486/33, and yes, the IPC mechanism meant that throughput suffered but for real time applications with a lot of the hard stuff in userspace context switches are far more important than throughput (and incidentally, also for perceived responsiveness of the OS and apps).

The latency numbers are excellent from the perspective of very forgiving applications, a typical DAW runs with 1K or even larger sample buffers which is acceptable, but for many real time applications that is an eternity and so those are not typically built using Linux as the core but some dedicated RTOS.

edit: I had 100K / second before, this was in error. It's been 30 years ;)

If you read the article linked to in the article I linked to above (which is fairly out of date, from 2010):

https://blog.tsunanet.net/2010/11/how-long-does-it-take-to-m...

you will find that on Linux a context switch takes about 30 usec. More recent measurements that take account of the effect of the TLB flush put the range at 10-300usec.

That means that in 2010, on Linux, you could reasonably expect to do at least 30k/sec. In 2021, with realistic audio processing workloads, the range is probably 3-50k/sec.

The 486 is a much lower register count than contemporary processors, which accounts for the faster context switching.

Modern audio processing software on Linux can run with 64 sample buffers, not 1k.

This recent paper on RT linux on RPi/Beagleboard single board systems concludes that on some of these relatively "low power" systems, 95% of latencies are in the 40-60usec range, which is completely adequate for the majority of RTOS tasks (but not all).

https://www.mdpi.com/2073-431X/10/5/64/pdf

>"The majority of Linux kernels’ measurements with PREEMPT_RT-patched kernel show the minimum response latency to be below 50 μs, both in user and kernel space. The maximum worst-case response latency (wcrl) reached 147 μs for RPi3 and 160 μs for BBB in user space, and 67 μs and 76 μs, respectively, in kernel space (average values). Most of the latencies are quite below this maximum (90% and 95%, respectively, for user space and kernel space). In general, it seems that maximal latencies do not often cross these values."

[ ... ]

"As an outcome, Linux kernels patched with PREEMPT_RT on such devices have the ability to run in a deterministic way as long as a latency value of about 160 μs, as an upper bound, is an acceptable safety margin. Such results reconfirm the reliability of such COTS devices running Linux with real-time support and extend their life cycle for the running applications."

This slide presentation offers up very similar numbers with graphs, also on ARM systems (I think):

https://elinux.org/image/d/de/Real_Time_Linux_Scheduling_Per...

This article shows cyclictest, a very minimal scheduling latency tester, getting the following results on an x86_64 system:

"The average average latency (Avg) is 4.875 us and the average maximum latency (Max) is 20.750 us, with the Max latency on 23 us. So, the average latency raises by 1.875 us, while the average maximum raises by 1.875 us, with the maximum latency raised by 2 us."

https://bristot.me/demystifying-the-real-time-linux-latency/

They conclude

> "Maximum observed latency values generally range from a few microseconds on single-CPU systems to 250 microseconds on non-uniform memory access systems, which are acceptable values for a vast range of applications with sub-millisecond timing precision requirements. This way, PREEMPT_RT Linux closely fulfills theoretical fully-preemptive system assumptions that consider atomic scheduling operations with negligible overheads."

I'm not sure where you're getting your current info from, but I'm extremely confident that it's wrong. If I had to guess, you have not kept up with the impact of the PREEMPT_RT patchset on the kernel, nor scheduling improvements in general, but I don't know (obviously).

The last time that I've been actively involved with the development of real time control of time critical hardware on linux was about 2007 (very high speed stepper motor driven plasmacutter, slow down in a curve and you've ruined the workpiece), so for sure I'm out of the loop but I do have a fairly large Linux audio setup with all of the real time patches installed and clearly if it is possible to run with 64 sample buffers I have not been able to do so on my hardware, 1K really is the minimum before I get - inevitably, unfortunately - dropouts under relatively light load.

It might be worth documenting my setup (reproduced across three different machines, a laptop, an 'all-in-one' and a very beefy desktop), to see what could be improved because that difference is substantial.