Hacker News new | ask | show | jobs
by nnx 3956 days ago
Intriguing. Thanks for sharing.

Doesn't Linux perform this "context switch at every syscall" ? How does it get away with the performance penalty?

4 comments

No, Linux x86-64 doesn't change %cr3 on syscalls. It mitigates this kind of bug (kernel NULL pointer dereference) in a different way - by not allowing userspace processes to map memory at NULL.

Linux also supports the SMAP feature on modern Intel CPUs which allows the kernel to set things up so that all accesses to usermode memory from kernel mode must be explicitly annotated.

All operating systems with separate user and kernel modes have a privilege-level round-trip on every syscall (typically `sysenter`/`sysexit`, on older systems the classic `int $0x80`/`iret`). This is just a controlled jump that changes the privilege level, and is what is bypassed by vsyscall.

Non-shared-cr3 Macs (and IIRC some versions of PaX) also change `%cr3`, which means user-space and kernel-space have completely different address spaces (rather than a shared kernel-space and per-process user-space). This is much more expensive.

On Linux, if you have a look at /proc/<pid>/maps, you'll see a 'vsyscall' section mapped into every program. That section has code stubs for each syscall. For some simple syscalls like gettimeofday() (not sure there are any others) just return the current time, which is stored somewhere in that area. For other syscalls, the stubs use the best method to enter the kernel (sysenter vs. int 80) available on your specific processor.
There were only ever "vsyscall" entries for three syscalls: time, gettimeofday, and getcpu.

On recent kernels, the vsyscalls are actually the slowest way of all to ask for the time or the cpu number. They're only supported at all as a fallback, and the fallback is very slow, because it tries to mitigate exploit risks due to having code at a fixed address.

https://lwn.net/Articles/446528/

OSX should be doing the context switch also.

The added penalty is a switch to usermode to read userland data, then a switch back to kernel to continue on...its just additional context switches for reading userland memory