Hacker News new | ask | show | jobs
by poizan42 3422 days ago
> The fact that that requires implementing Linux kernel functionality is simply an artifact of the lack of a more modular design (or really any up front design) in Linux.

Curiously the recommended syscall mechanism on x86 is by calling __kernel_vsyscall in the vDSO. If everybody did that then you could just make your own loader with your own custom vDSO that could implement the syscalls in userspace. However sometimes (especially statically built) programs still makes syscall directly with int 80h, which are slow to trap in userspace or may not even be possible depending on the os.

Now it would have been fantastic if calling through the vDSO had simply been the only documented way of doing a syscall on x86_64, but the kernel developers at that time decided not to do that, so now on x86_64 the syscall instruction is always used directly, and we can't even trap that on any x86_64 OS because most of the time all that happens is either some unrelated syscall is executed, or an error is returned from the kernel back to the calling process without any traps.

1 comments

vDSO maintainer here.

> Curiously the recommended syscall mechanism on x86 is by calling __kernel_vsyscall in the vDSO.

There are times when this doesn't work. Syscall resumption and cancellation come to mind. Also, __kernel_vsyscall is a hack to make fast syscalls work on the awful 32-bit x86 architecture, not a nice feature.

> it would have been fantastic if calling through the vDSO had simply been the only documented way of doing a syscall on x86_64

There is no __kernel_vsyscall or similar feature on x86_64.

> we can't even trap that on any x86_64 OS

You can on Linux using seccomp.

> There are times when this doesn't work. Syscall resumption and cancellation come to mind. Also, __kernel_vsyscall is a hack to make fast syscalls work on the awful 32-bit x86 architecture, not a nice feature.

It's a pretty nice feature in the context of being able to make compatibility layers on other OS'ses in userspace which was the discussion here. Or would be if it was always used. Why doesn't it work for syscall resumption and cancellation?

> There is no __kernel_vsyscall or similar feature on x86_64.

Yes that was exactly what I was complaining about.

> You can on Linux using seccomp.

Yes but why would I want to make a linux compatibility layer on linux?

> It's a pretty nice feature in the context of being able to make compatibility layers on other OS'ses in userspace which was the discussion here. Or would be if it was always used. Why doesn't it work for syscall resumption and cancellation?

For resumption, a signal that interrupts a resumable syscall points RIP to an explicit int 80 instruction in the vDSO. This behavior would be a bit unfriendly to emulate.

For cancellation, the only good implementation of cancellation that I'm aware of (musl's) relies on syscalls being an actual atomic instruction so that a signal handler can tell whether a syscall actually happened. __kernel_vsyscall is an opaque function and can't be used like this.

> Yes but why would I want to make a linux compatibility layer on linux?

For sandboxing? For experimentation? Or how about to make a compatibility layer emulating something else that runs on Linux?