Hacker News new | ask | show | jobs
by dzdt 3084 days ago
When using these patches on statically linked applications, especially C++ applications, you should expect to see a much more dramatic performance hit. For microbenchmarks that are switch, indirect-, or virtual-call heavy we have seen overheads ranging from 10% to 50%.

Ouch! This is independent of other performance hurts, like from the kernel syscall overhead that was the hot topic yesterday. This is pretty crazy.

7 comments

That's bad. A single 5% hit might not be the end of the world, but 5% here and 10% there and another 5% over there in the common case adds up badly enough. Doubly-pathological cases (indirect calling-heavy code calling lots of syscalls)... a 50% slowdown and a 30% slowdown combines to a 60% total slowdown. Yeowch.

Will be intrigued to see how processor manufacturers respond to this. If they were even slightly relaxed about it prior to disclosure I expect there's going to be some very hurried attempts to engineer some solutions pronto. This is the sort of thing where it might even be worth throwing away all of your future roadmap plans and just getting a revision of the current chips out there ASAP, whatever that may do to the rest of your roadmap.

Sounds like it could be great for processor manufacturers. In the age where CPUs don't get faster, there's finally a reason for customers to buy new CPUs again!
Not really, once a program is compiled with -retpoline, new hardware won't bring back reliable branch prediction.

I'd hope maybe, just maybe, this would be enough to put a focus on compilers producing code that ends up using processor-optimized paths chosen at runtime, to avoid "overheads ranging from 10% to 50%".

Though, in this case, that would essentially mean making the entire executable region writable for some window of time, which is clearly too dangerous, so I guess the 0.1% speedups from compiling undefined behavior in new and interesting ways, will continue taking priority.

I mean, it's a compiler flag right, obviously whoever's going to run a program on an unaffected platform will take the effort to recompile everything with the flag removed.

Just the same way every serious application currently provides different executables for running on systems where SSE2, SSE4.1, or AVX2 is present.

Horizontal scaling though. If every individual processor is slower, more are needed.
Not quite - lots of "serious" applications these days are written to target JIT compilers, which would be capable of switching retpoline on and off depending on need.
Funnily enough, I ended up not including a PS starting with "A sufficiently smart JIT, however..." ;)
I'd rather have linkers go down a similar road that the Linux kernel went on a over a decade ago: provide binary patches in a table (essentially alternative machine code) and have the linker patch the correct alternative depending on the CPU and it's bugs. The Linux kernel already contains an "alternatives" segment which is exactly this kind of list of patches. It would be trivial to add such a table to ELF and PE formats and have the runtime linker process that while it's plowing through the code anyway.
Something like this exists with function multi-versioning: https://lwn.net/Articles/691932/

For example, glibc chooses optimised machine code for memcpy depending on the CPU it runs on.

New CPUs could just convert the retpoline back to the original jump in microcode, and enable the now timing-attack safe branch predictor.
But even then a performance hit remains due to the increased code size of the instruction sequence.
There's a lot of space left in code already to insert trampolines later. And in the end of the day most memory is data, not code.

And eventually, this code will get replaced anyway (just like today there are often multiple code paths in binaries, and a lot of code is compiled for host anyway).

In any case, the performance impact of a couple extra bytes per indirect call is small compared to disabling branch target prediction.

Makes me wonder if that's Intel's PR plan (hence their spin on the story). Assuage the general consumer and mainstream media, point the finger at the OS vendors if necessary, and fix the bug in their next-gen chips.

When this all shakes out, the general story is going to be "upgrade sooner, current-gen Intel chips are x% faster", where x is going to be a larger number than it was a week ago.

It's more-or-less the Apple battery story all over again. Current devices are going to be slower, newer ones are going to be faster. Even if you know the why and the how, you're still in the same place as everybody else (at best, you could upgrade just the chip if your MB is new enough, but you're still buying Intel). Unless there's some clear way of imposing the external cost of this bug on Intel, it's a win-win for them.

Here's another potential PR spin: the major cloud vendors got out ahead of this so that they could point the finger at Intel, making customers not ask "Why did you sell me this fundamentally insecure server for so many years?"

Because this isn't really that new nor is it really a bug. Meltdown could be a bug because the asynchronous access of memory in other protection rings is unsafe, but the rest of it is just a normal side-channel attack, an abuse of otherwise-innocent data.

How much moral culpability should rest on the proprietors of software virtualization technologies that don't really safely encapsulate anything due to hardware incongruencies with the modern "sandboxed computing" model?

Surely there have been engineers over the years who've questioned the propriety of misleading users into seeing VMs as fully-encapsulated systems when the hardware just fundamentally doesn't support that type of native encapsulation. Such persons have probably spent the last several years being shunned for being old fogies overly attached to their rust buckets. It'd be interesting to hear some of their stories.

> Even if you know the why and the how, you're still in the same place as everybody else

If you know then you at least have the option to turn off PTI and compile your performance-critical binaries without retpoline.

And maybe disable microcode updates if intel's updates eat another few percent.

Multicore performance is getting faster at a very high pace.

Ryzen 1700X has 3.3 times the multicore performance of my i5 3450 from 2012 for almost the same price. With 7nm we will see at least another 50% increase in multicore performance on top of that.

Will linux distributions automatically use this compilation option (or its analog in GCC) for packages from now until forever, even if a faster mitigation is added to CPUs?
I use a binary distro and definitely don't want to be running massively slowed-down software mitigations on a corrected CPU.

Although actually, we already are - binary distros already don't take into account per-microarchitecture scheduling, nor any ISA extensions above a common baseline (e.g. just SSE2, no autovectorising to AVX2 etc).

This might provide enough impetus to restructure how binary distros work and get the whole distro compiled with some newer CPU flags (march={first corrected architecture}?) but in the short term i assume every package will take the hit.

Great time to learn about source-based distros!

Not to worry, it’s “just” 5–10% for “well tuned servers using all of [performance-saving] techniques”.
The sentence that follows the line you quoted is

> However, real-world workloads exhibit substantially lower performance impact.

I feel like you could have mentioned this.

I thought it would be Moore's law that forces people to care about their codes' performance. I was wrong, but am nevertheless happy about the recent developments. Programming will become an art once again :)
Agreed. dlopen() should wipe branch prediction caches by default, we need to add additional flags that turn this off.
It's ok. The 9th generation of Intel will be 50% faster and the most secure CPU ever made! /s