Hacker News new | ask | show | jobs
by comex 2495 days ago
The other replies to your post talk about various older or embedded systems, but here's the answer for the typical systems actually affected by that CVE, running 32-bit or 64-bit x86:

If nothing is mapped at 0, the kernel will fault just like userland would. This results in a kernel panic.

However, the kernel and userland share an address space. On a 32-bit x86 system with the default configuration, Linux allocates addresses 0 to 0xc0000000 to userland, and 0xc0000000 to 0xffffffff to the kernel. [1] (Each userland process had its own page table, but the kernel mapped itself into every page table.) This is unavoidable to some extent, because an interrupt or system call switches the system to kernel mode and jumps to a kernel-provided address, but does not automatically swap the page table, so at least the interrupt handler needs to be mapped into every page table. [2] x86-64 is similar, but with the upper half of the address space reserved for the kernel.

So why can't user code mess with the kernel's data? Each entry in the page table has a single privilege level bit. If it's set, both user and kernel code can access the page; if it's clear, only kernel code can access it. [3] At the time, there was no way to make memory accessible from user code but not the kernel, as that was considered unnecessary. Thus, userland couldn't access kernel pointers, but the kernel could directly load/store pointers belonging to the current user process. This was used intentionally when the kernel needed to copy data in or out of the process, but it also meant that if the kernel code accidentally dereferenced a bad pointer, it could end up referring to userland data.

That included the null pointer: accesses to it would succeed if and only if the current user program had previously mapped something at address 0, via mmap() with the MAP_FIXED flag. And that's what the exploit code did.

The page tables are under the kernel's control, so the kernel could make null pointer dereferences unexploitable (resulting in a kernel panic but nothing more) simply by refusing to allow user processes to map memory at address 0 – and in fact Linux already had an setting to do so (mmap_min_addr). But it was an setting rather than simply hardcoded into the kernel, because... well, some real software actually depends on mapping address 0 for silly reasons, mostly pseudo-emulation software like dosemu and wine which directly runs the emulated code in its address space. So not all systems had the setting enabled, and there was also a separate issue where enabling SELinux would cause mmap_min_addr to be ignored. [4]

Years later, Intel added an extension in newer CPUs called SMAP (Supervisor Mode Access Protection), which is simply a flag that makes the kernel fault if it tries to access pages marked as accessible to userland. In other words, the privilege bit now selects between kernel only and user only. Much saner – after all, even with mmap_min_addr blocking exploitation of null pointers, other garbage pointers could still end up pointing to userland, which made it easier to exploit the kernel (though, compared to the situation with null pointers, it's more often a question of "how easy is it to write an exploit" or "how reliable is the exploit" than of exploitable versus unexploitable). The kernel and userland still share an address space, though, so the kernel can just toggle off the flag when it's intentionally accessing userland data.

(Even later, the Meltdown hardware vulnerability triggered the implementation of kernel page-table isolation, but that's another story.)

[1] https://lwn.net/Articles/75174/

[2] https://stackoverflow.com/questions/32598810/does-cr3-change...

[3] https://webcache.googleusercontent.com/search?q=cache:b5g4ss...

[4] https://blog.namei.org/2009/07/18/a-brief-note-on-the-2630-k...

1 comments

Thanks for the very detailed explanation!

If I understand correctly, the fact that if a page is mapped by a process at address zero allows both userland and kernel code to trigger unexpected code paths, since page access isn't exclusively kernel or userland. The optimizations mentioned in TFA add even more potential for issues, since userland code could control pointers in that zero page to point to arbitrary data in userland that the kernel can read.

This is fascinating, I didn't know it was possible to share pages between userland and the kernel, and always assumed those two were strictly segregated.

Yep. Something I didn't mention is that if you just try to allocate memory without using MAP_FIXED to force a particular address, the kernel will never choose address 0, regardless of the value of mmap_min_addr. That's true even if the entire rest of the address space is filled. Therefore, userland programs can rely on accesses to address 0 causing a fault unless they specifically ask to map it, which makes the compiler optimization in question perfectly reasonable for most of them. After all, a userland program doesn't worry about being exploited by itself.

(There's still potential for unexpected behavior in those userland programs that do map 0, like wine and dosemu. Even if those programs themselves are compiled with -fno-delete-null-pointer-checks – I'm not sure whether they are – they link to system libraries which aren't. Oh well.)