Hacker News new | ask | show | jobs
by TrueDuality 1535 days ago
The reason that containers are not generally considered a security boundary is that many of the namespace primitives were _not designed_ as a security layer, they aren't designed to actively reduce the privileges from the current user's context. Since most containers are started as the root user, the namespace transition inherits root's permissions even if they're later dropped. Without SELinux or seccomp restrictions, root can still pretty much do anything to the host even inside the containers.

For the most part this is troublesome when parts of the kernel or host userspace code are not fully aware of the different forms of namespacing (there are still portions that just check for an effective UID of 0, without checking whether they're in a namespace for example). These are the components where a lot of container breakouts happen and is largely mitigated by having internal processes in the container not running as root in the namespace. Dropping privileges to a different user still trace's it origin back to the root user on the host, so in some cases being partially aware of namespaces in a section of the kernel or host user code actively hurts the security by tracing the user back to root and using those privileges again. SELinux really tightens the potential to pull these shenanigans, but most production k8s clusters at least that I've seen are built on Ubuntu where those protections aren't available. In this case the security layer is once again SELinux not the namespacing.

As long as the container runtime is performing the various namespace isolation primitives starting from the root user these container bypasses are going to be a risk. There are 'rootless' versions of containers which can only use the privileges available to lower (presumably heavily restricted) user but those aren't widely used. Once again this is relying on the security protections of the host user authorization, not on the namespaces.

The networking analogy is NAT. People treat it like a security layer as it kind-of-sort-of looks like an ingress firewall since you can't directly address devices inside a NAT, but its not and can be pierced pretty easily. NAT is not a firewall. Namespaces are not a security layer.

1 comments

> Without SELinux or seccomp restrictions, root can still pretty much do anything to the host even inside the containers.

That's not true

        Having a capability inside a user namespace permits a process to
       perform operations (that require privilege) only on resources
       governed by that namespace.  In other words, having a capability
       in a user namespace permits a process to perform privileged
       operations on resources that are governed by (nonuser) namespaces
       owned by (associated with) the user namespace (see the next
       subsection).

       On the other hand, there are many privileged operations that
       affect resources that are not associated with any namespace type,
       for example, changing the system (i.e., calendar) time (governed
       by CAP_SYS_TIME), loading a kernel module (governed by
       CAP_SYS_MODULE), and creating a device (governed by CAP_MKNOD).
       Only a process with privileges in the initial user namespace can
       perform such operations.
> For the most part this is troublesome when parts of the kernel or host userspace code are not fully aware of the different forms of namespacing (there are still portions that just check for an effective UID of 0, without checking whether they're in a namespace for example).

Yes, like I said:

> But an attacker can escape by exploiting the kernel, which I think most security people would consider to be not particularly high effort.

> Dropping privileges to a different user still trace's it origin back to the root user on the host

It does not. Only if the process creating the container is root, which with unprivileged user namespaces is not (necessarily) the case.

> The NS_GET_OWNER_UID ioctl(2) operation can be used to discover the user ID of the owner of the namespace; see ioctl_ns(2).

"root" isn't the point anyways, it's about checking capabilities. The problem is that the Linux kernel has historically not cared about root -> kernel privesc, and containers expose more attack surface because of that. But an attacker outside of a container can still just enter a namespace (user namespaces are unprivileged) and perform the same exact privesc, so containers aren't making anything worse.

> As long as the container runtime is performing the various namespace isolation primitives starting from the root user these container bypasses are going to be a risk. There are 'rootless' versions of containers which can only use the privileges available to lower (presumably heavily restricted) user but those aren't widely used.

That's not how namespaces work. Even with 'rootless' containers your guest has CAP_SYS_ADMIN. The only difference is that the daemon that starts the container isn't privileged because user namespaces are increasingly becoming unprivileged. Rootless changes nothing, except that attacks against the daemon itself won't be an insta-privesc to root on the host, they'll only be a privesc to the user running the daemon on the host.

Anyway, let's step back.

What is a security boundary? I would say it is a mechanism by which an attacker is restricted where the attacker must exploit a vulnerability in order to get around that restriction. By that measure, containers are a boundary. Is exploitation difficult? Not necessarily, like I said, the Linux kernel has loads of attack surface. But it meets a reasonable criteria for a boundary.

As an example, chroot on its own is not a boundary because attackers can just call chroot again - this requires no vulnerability, it will never be patched, and you need another layer to prevent that. Containers have nothing like that, there is no "just let me out" syscall, you require another vulnerability.

You can read more about user namespaces here:

https://www.man7.org/linux/man-pages/man7/user_namespaces.7....