Hacker News new | ask | show | jobs
by JoshTriplett 770 days ago
> The system ABI of Linux really isn't the syscall interface, its the system libc.

You might have reasons to prefer to use libc; some software has reason to not use libc. Those preferences are in conflict, but one of them is not automatically right and the other wrong in all circumstances.

Many UNIX systems did follow the premise that you must use libc and the syscall interface is unstable. Linux pointedly did not, and decided to have a stable syscall ABI instead. This means it's possible to have multiple C libraries, as well as other libraries, which have different needs or goals and interface with the system differently. That's a useful property of Linux.

There are a couple of established mechanism on Linux for intercepting syscalls: ptrace, and BPF. If you want to intercept all uses of a syscall, intercept the syscall. If you want to intercept a particular glibc function in programs using glibc, or for that matter a musl function in a program using musl, go ahead and use LD_PRELOAD. But the Linux syscall interface is a valid and stable interface to the system, and that's why LD_PRELOAD is not a complete solution.

2 comments

It's true that Linux has a stable-ish syscall table. What is funny is that this caused a whole series of Samsung Android phones to reboot randomly with some apps because Samsung added a syscall at the same position someone else did in upstream linux and folks staticly linking their own libc to avoid boionc libc were rebooting phones when calling certain functions because the Samsung syscall causing kernel panics when called wrong. Goes back to it being a bad idea to subvert your system libc. Now, distro vendors do give out multiple versions of a libc that all work with your kernel. This generally works. When we had to fix ABI issues this happened a few times. But I wouldn't trust building our libc and assuming that libc is portable to any linux machine to copy it to.
> It's true that Linux has a stable-ish syscall table.

It's not "stable-ish", it's fully stable. Once a syscall is added to the syscall table on a released version of the official Linux kernel, it might later be replaced by a "not implemented" stub (which always returns -ENOSYS), but it will never be reused for anything else. There's even reserved space on some architectures for the STREAMS syscalls, which were AFAIK never on any released version of the Linux kernel.

The exception is when creating a new architecture; for instance, the syscall table for 32-bit x86 and 64-bit x86 has a completely different order.

I think what they meant (judging by the example you ignored) is that the table changes (even if append-only) and you don't know which version you actually have when you statically compile your own version. Thus, your syscalls might be using a newer version of the table but it a) not actually be implemented, or b) implemented with something bespoke.
> Thus, your syscalls might be using a newer version of the table but it a) not actually be implemented,

That's the same case as when a syscall is later removed: it returns -ENOSYS. The correct way is to do the call normally as if it were implemented, and if it returns -ENOSYS, you know that this syscall does not exist in the currently running kernel, and you should try something else. That is the same no matter whether it's compiled statically or dynamically; even a dynamic glibc has fallback paths for some missing syscalls (glibc has a minimum required kernel version, so it does not need to have fallback paths for features introduced a long time ago).

> or b) implemented with something bespoke.

There's nothing you can do to protect against a modified kernel which does something different from the upstream Linux kernel. Even going through libc doesn't help, since whoever modified the Linux kernel to do something unexpected could also have modified the C library to do something unexpected, or libc could trip over the unexpected kernel changes.

One example of this happening is with seccomp filters. They can be used to make a syscall fail with an unexpected error code, and this can confuse the C library. More specifically, a seccomp filter which forces the clone3 syscall to always return -EPERM breaks newer libc versions which try the clone3 syscall first, and then fallback to the older clone syscall if clone3 returned -ENOSYS (which indicates an older kernel that does not have the clone3 syscall); this breaks for instance running newer Linux distributions within older Docker versions.

Every kernel I’ve ever used has been different from an upstream kernel, with custom patches applied. It’s literally open source, anyone can do anything to it that they want. If you are using libc, you’d have a reasonable expectation not to need to know the details of those changes. If you call the kernel directly via syscall, then yeah, there is nothing you can do about someone making modifications to open source software.
Generally, if you're patching the kernel you should coordinate with upstream if you care about your ABI.
I don’t think anyone is asking Linus for permission to backport security fixes to their own kernels… that’s ridiculous.
If your security fix alters the ABI it is almost always the case that this is done
The complication with the linux syscall interface is that it turns the worse is better up to 11. Like setuid works on a per thread basis, which is seriously not what you want, so every program/runtime must do this fun little thread stop and start and thunk dance.
Yeah, agreed. One of the items on my long TODO list is adding `setuid_process` and `setgid_process` and similar, so that perhaps a decade later when new runtimes can count on the presence of those syscalls, they can stop duplicating that mechanism in userspace.