Hacker News new | ask | show | jobs
by qwertyoruiop 3956 days ago
add -no_shared_cr3 to your boot-args.

it will have an hefty performance penalty, but if you value security over performance, it'll also protect you against a lot of (even 0day!) exploits.

3 comments

Could you provide some context for that? What does that flag do? How do you even set boot args for OSX? I have little context to OSX boot process, and would like to understand this better.
'sudo nvram boot-args=-no_shared_cr3' will do the trick.

The flag essentially prevents kernel from accessing userland memory unless special routines are used. Since the bug is a NULL pointer deference (which requires a read to userland memory in order to be exploited), exploitation becomes impossible. Due to this flag, however, your kernel will have to context switch every time a system call is done, which does have a noticeable performance impact. I will be releasing a KEXT to fix the bug soon.

Well, I was more looking for an explanation of what "no shared CR3" means. What is CR3, how do I know to go to that option as a way to disable this exploit.

And, coming from a Grub/ubuntu perspective, when you say "boot args", I think of the boot loader, which for Grub is configured with config files (text files) or else at boot-time, via the Grub menu. I know OSX has a single-user mode, but don't know of a way to edit boot args prior to completing the boot sequence.

Please don't take this wrong. I'm glad to see the original fix you gave, so much so that I want to know more about it. What provides the capability, how to know the specific options that mitigate such a vulnerability.

CR3: https://en.wikipedia.org/wiki/Control_register#CR3

Some relevant excerpts from Mac OS X and iOS Internals: To the Apple's Core by Jonathan Levi:

From page 133: In 64-bit mode, there is such a huge amount of memory available anyway that it makes sense to follow the model used in other operating systems, namely to map the kernel’s address space into each and every process. This is a departure from the traditional OS X model, which had the kernel in its own address space, but it makes for much faster user/kernel transition (by sharing CR3, the control register containing the page tables).

From page 266: Still, unlike Windows or Linux, OS X applications in 32-bit (Intel) used to enjoy a largely unfettered address space with virtually no kernel reservation — that is, the kernel had its own address space. Apple has conformed, however, and in 64-bit mode OS X behaves more like its monolithic peers: the kernel/user address spaces are shared, unless otherwise stated (by setting the -no-shared-cr3 boot argument on Intel architectures). The same holds true in iOS, wherein XNU currently reserves the top 2 GB of the 4 GB address space (prior to iOS version 4 the separation was 3 GB user/1 GB kernel).

On x86 cr3 is a pointer to the page table. (The page table is a mapping set up by the kernel, it maps virtual to physical addresses or in some cases lets the kernel trap memory accesses.) Once you change it, memory access becomes temporarily slower afaik because the TLB (effectively cpu's cache of page table) is discarded. So changing it more frequently can be a bad thing.
On the second part, Macs boot with UEFI, and the boot process is configured via a small number of variables stored in writable firmware memory [1]. Apple provides a command-line tool, nvram(8) [2], which can either print the current contents of the variables (nvram -p), or request a change to one. Changes are queued and written out at the next clean shutdown or reboot.

[1] A brief description: https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_In...

[2] https://developer.apple.com/library/mac/documentation/Darwin...

Intriguing. Thanks for sharing.

Doesn't Linux perform this "context switch at every syscall" ? How does it get away with the performance penalty?

No, Linux x86-64 doesn't change %cr3 on syscalls. It mitigates this kind of bug (kernel NULL pointer dereference) in a different way - by not allowing userspace processes to map memory at NULL.

Linux also supports the SMAP feature on modern Intel CPUs which allows the kernel to set things up so that all accesses to usermode memory from kernel mode must be explicitly annotated.

All operating systems with separate user and kernel modes have a privilege-level round-trip on every syscall (typically `sysenter`/`sysexit`, on older systems the classic `int $0x80`/`iret`). This is just a controlled jump that changes the privilege level, and is what is bypassed by vsyscall.

Non-shared-cr3 Macs (and IIRC some versions of PaX) also change `%cr3`, which means user-space and kernel-space have completely different address spaces (rather than a shared kernel-space and per-process user-space). This is much more expensive.

On Linux, if you have a look at /proc/<pid>/maps, you'll see a 'vsyscall' section mapped into every program. That section has code stubs for each syscall. For some simple syscalls like gettimeofday() (not sure there are any others) just return the current time, which is stored somewhere in that area. For other syscalls, the stubs use the best method to enter the kernel (sysenter vs. int 80) available on your specific processor.
There were only ever "vsyscall" entries for three syscalls: time, gettimeofday, and getcpu.

On recent kernels, the vsyscalls are actually the slowest way of all to ask for the time or the cpu number. They're only supported at all as a fallback, and the fallback is very slow, because it tries to mitigate exploit risks due to having code at a fixed address.

https://lwn.net/Articles/446528/

OSX should be doing the context switch also.

The added penalty is a switch to usermode to read userland data, then a switch back to kernel to continue on...its just additional context switches for reading userland memory

CR3 is the x86 register that points to the root page table. When an OS switches between processes, it generally changes CR3. On the other hand, when an OS switches from user mode to kernel mode, it usually leaves CR3 alone.

I know essentially nothing about Darwin, but "no shared CR3" presumably means that the kernel will switch CR3 to make user memory inaccessible when running in kernel mode. This is approximately what grsecurity's UDEREF feature does.

On Linux, on Broadwell or newer, there's a similar HW mitigation called SMAP. Darwin might use it, too.

Linux also doesn't allow unprivileged programs to map very low addresses, making NULL pointer dereferences much harder to exploit.

I did `sudo nvram boot-args="-no_shared_cr3"`, and I think I see the (seemingly minor, at this point) performance hit. There's an explanatory comment in http://opensource.apple.com/source/xnu/xnu-2782.1.97/osfmk/x... that seems to explain the feature, though not in detail.
What do you mean by "minor performance hit"? Could you provide some data?
Unfortunately this 'fix' seems to break VirtualBox.
Thanks for the tip, applied.