| HN Mirror

> Without SELinux or seccomp restrictions, root can still pretty much do anything to the host even inside the containers.

That's not true

        Having a capability inside a user namespace permits a process to
       perform operations (that require privilege) only on resources
       governed by that namespace.  In other words, having a capability
       in a user namespace permits a process to perform privileged
       operations on resources that are governed by (nonuser) namespaces
       owned by (associated with) the user namespace (see the next
       subsection).

       On the other hand, there are many privileged operations that
       affect resources that are not associated with any namespace type,
       for example, changing the system (i.e., calendar) time (governed
       by CAP_SYS_TIME), loading a kernel module (governed by
       CAP_SYS_MODULE), and creating a device (governed by CAP_MKNOD).
       Only a process with privileges in the initial user namespace can
       perform such operations.

> For the most part this is troublesome when parts of the kernel or host userspace code are not fully aware of the different forms of namespacing (there are still portions that just check for an effective UID of 0, without checking whether they're in a namespace for example).

Yes, like I said:

> But an attacker can escape by exploiting the kernel, which I think most security people would consider to be not particularly high effort.

> Dropping privileges to a different user still trace's it origin back to the root user on the host

It does not. Only if the process creating the container is root, which with unprivileged user namespaces is not (necessarily) the case.

> The NS_GET_OWNER_UID ioctl(2) operation can be used to discover the user ID of the owner of the namespace; see ioctl_ns(2).

"root" isn't the point anyways, it's about checking capabilities. The problem is that the Linux kernel has historically not cared about root -> kernel privesc, and containers expose more attack surface because of that. But an attacker outside of a container can still just enter a namespace (user namespaces are unprivileged) and perform the same exact privesc, so containers aren't making anything worse.

> As long as the container runtime is performing the various namespace isolation primitives starting from the root user these container bypasses are going to be a risk. There are 'rootless' versions of containers which can only use the privileges available to lower (presumably heavily restricted) user but those aren't widely used.

That's not how namespaces work. Even with 'rootless' containers your guest has CAP_SYS_ADMIN. The only difference is that the daemon that starts the container isn't privileged because user namespaces are increasingly becoming unprivileged. Rootless changes nothing, except that attacks against the daemon itself won't be an insta-privesc to root on the host, they'll only be a privesc to the user running the daemon on the host.

Anyway, let's step back.

What is a security boundary? I would say it is a mechanism by which an attacker is restricted where the attacker must exploit a vulnerability in order to get around that restriction. By that measure, containers are a boundary. Is exploitation difficult? Not necessarily, like I said, the Linux kernel has loads of attack surface. But it meets a reasonable criteria for a boundary.

As an example, chroot on its own is not a boundary because attackers can just call chroot again - this requires no vulnerability, it will never be patched, and you need another layer to prevent that. Containers have nothing like that, there is no "just let me out" syscall, you require another vulnerability.

You can read more about user namespaces here:

https://www.man7.org/linux/man-pages/man7/user_namespaces.7....