| > Without SELinux or seccomp restrictions, root can still pretty much do anything to the host even inside the containers. That's not true Having a capability inside a user namespace permits a process to
perform operations (that require privilege) only on resources
governed by that namespace. In other words, having a capability
in a user namespace permits a process to perform privileged
operations on resources that are governed by (nonuser) namespaces
owned by (associated with) the user namespace (see the next
subsection).
On the other hand, there are many privileged operations that
affect resources that are not associated with any namespace type,
for example, changing the system (i.e., calendar) time (governed
by CAP_SYS_TIME), loading a kernel module (governed by
CAP_SYS_MODULE), and creating a device (governed by CAP_MKNOD).
Only a process with privileges in the initial user namespace can
perform such operations.
> For the most part this is troublesome when parts of the kernel or host userspace code are not fully aware of the different forms of namespacing (there are still portions that just check for an effective UID of 0, without checking whether they're in a namespace for example).Yes, like I said: > But an attacker can escape by exploiting the kernel, which I think most security people would consider to be not particularly high effort. > Dropping privileges to a different user still trace's it origin back to the root user on the host It does not. Only if the process creating the container is root, which with unprivileged user namespaces is not (necessarily) the case. > The NS_GET_OWNER_UID ioctl(2) operation
can be used to discover the user ID of the owner of the
namespace; see ioctl_ns(2). "root" isn't the point anyways, it's about checking capabilities. The problem is that the Linux kernel has historically not cared about root -> kernel privesc, and containers expose more attack surface because of that. But an attacker outside of a container can still just enter a namespace (user namespaces are unprivileged) and perform the same exact privesc, so containers aren't making anything worse. > As long as the container runtime is performing the various namespace isolation primitives starting from the root user these container bypasses are going to be a risk. There are 'rootless' versions of containers which can only use the privileges available to lower (presumably heavily restricted) user but those aren't widely used. That's not how namespaces work. Even with 'rootless' containers your guest has CAP_SYS_ADMIN. The only difference is that the daemon that starts the container isn't privileged because user namespaces are increasingly becoming unprivileged. Rootless changes nothing, except that attacks against the daemon itself won't be an insta-privesc to root on the host, they'll only be a privesc to the user running the daemon on the host. Anyway, let's step back. What is a security boundary? I would say it is a mechanism by which an attacker is restricted where the attacker must exploit a vulnerability in order to get around that restriction. By that measure, containers are a boundary. Is exploitation difficult? Not necessarily, like I said, the Linux kernel has loads of attack surface. But it meets a reasonable criteria for a boundary. As an example, chroot on its own is not a boundary because attackers can just call chroot again - this requires no vulnerability, it will never be patched, and you need another layer to prevent that. Containers have nothing like that, there is no "just let me out" syscall, you require another vulnerability. You can read more about user namespaces here: https://www.man7.org/linux/man-pages/man7/user_namespaces.7.... |