Hacker News new | ask | show | jobs
by csmpltn 1535 days ago
> "Exactly. containers are not secure sandboxes by default and if one is breached all those K8s networking ACLs are worthless."

Your suggestion being? Putting a sandbox inside a sandbox? How many layers deep should this be, before being considered "secure"?

3 comments

Most serious security teams do not consider containers a security boundary. So it’s not a sandbox inside a sandbox, it’s just a sandbox.

Gvisor and firecracker are the most popular sandboxes for containerized workloads.

I think this is outdated. Docker is a security boundary. There is no built-in way to get out of a Docker container just by asking by default (if you mount the socket into the container, it's trivial).

How good of a boundary it is may be another story. There's some seccomp filters going on and namespacing is pretty sweet too.

But an attacker can escape by exploiting the kernel, which I think most security people would consider to be not particularly high effort.

So, suitable for internal services that you generally trust, not suitable for hostile code or highly exposed services. In an ideal world maybe we'd all use Firecracker but it's not nearly as easy to do that vs just putting something in a container.

The reason that containers are not generally considered a security boundary is that many of the namespace primitives were _not designed_ as a security layer, they aren't designed to actively reduce the privileges from the current user's context. Since most containers are started as the root user, the namespace transition inherits root's permissions even if they're later dropped. Without SELinux or seccomp restrictions, root can still pretty much do anything to the host even inside the containers.

For the most part this is troublesome when parts of the kernel or host userspace code are not fully aware of the different forms of namespacing (there are still portions that just check for an effective UID of 0, without checking whether they're in a namespace for example). These are the components where a lot of container breakouts happen and is largely mitigated by having internal processes in the container not running as root in the namespace. Dropping privileges to a different user still trace's it origin back to the root user on the host, so in some cases being partially aware of namespaces in a section of the kernel or host user code actively hurts the security by tracing the user back to root and using those privileges again. SELinux really tightens the potential to pull these shenanigans, but most production k8s clusters at least that I've seen are built on Ubuntu where those protections aren't available. In this case the security layer is once again SELinux not the namespacing.

As long as the container runtime is performing the various namespace isolation primitives starting from the root user these container bypasses are going to be a risk. There are 'rootless' versions of containers which can only use the privileges available to lower (presumably heavily restricted) user but those aren't widely used. Once again this is relying on the security protections of the host user authorization, not on the namespaces.

The networking analogy is NAT. People treat it like a security layer as it kind-of-sort-of looks like an ingress firewall since you can't directly address devices inside a NAT, but its not and can be pierced pretty easily. NAT is not a firewall. Namespaces are not a security layer.

> Without SELinux or seccomp restrictions, root can still pretty much do anything to the host even inside the containers.

That's not true

        Having a capability inside a user namespace permits a process to
       perform operations (that require privilege) only on resources
       governed by that namespace.  In other words, having a capability
       in a user namespace permits a process to perform privileged
       operations on resources that are governed by (nonuser) namespaces
       owned by (associated with) the user namespace (see the next
       subsection).

       On the other hand, there are many privileged operations that
       affect resources that are not associated with any namespace type,
       for example, changing the system (i.e., calendar) time (governed
       by CAP_SYS_TIME), loading a kernel module (governed by
       CAP_SYS_MODULE), and creating a device (governed by CAP_MKNOD).
       Only a process with privileges in the initial user namespace can
       perform such operations.
> For the most part this is troublesome when parts of the kernel or host userspace code are not fully aware of the different forms of namespacing (there are still portions that just check for an effective UID of 0, without checking whether they're in a namespace for example).

Yes, like I said:

> But an attacker can escape by exploiting the kernel, which I think most security people would consider to be not particularly high effort.

> Dropping privileges to a different user still trace's it origin back to the root user on the host

It does not. Only if the process creating the container is root, which with unprivileged user namespaces is not (necessarily) the case.

> The NS_GET_OWNER_UID ioctl(2) operation can be used to discover the user ID of the owner of the namespace; see ioctl_ns(2).

"root" isn't the point anyways, it's about checking capabilities. The problem is that the Linux kernel has historically not cared about root -> kernel privesc, and containers expose more attack surface because of that. But an attacker outside of a container can still just enter a namespace (user namespaces are unprivileged) and perform the same exact privesc, so containers aren't making anything worse.

> As long as the container runtime is performing the various namespace isolation primitives starting from the root user these container bypasses are going to be a risk. There are 'rootless' versions of containers which can only use the privileges available to lower (presumably heavily restricted) user but those aren't widely used.

That's not how namespaces work. Even with 'rootless' containers your guest has CAP_SYS_ADMIN. The only difference is that the daemon that starts the container isn't privileged because user namespaces are increasingly becoming unprivileged. Rootless changes nothing, except that attacks against the daemon itself won't be an insta-privesc to root on the host, they'll only be a privesc to the user running the daemon on the host.

Anyway, let's step back.

What is a security boundary? I would say it is a mechanism by which an attacker is restricted where the attacker must exploit a vulnerability in order to get around that restriction. By that measure, containers are a boundary. Is exploitation difficult? Not necessarily, like I said, the Linux kernel has loads of attack surface. But it meets a reasonable criteria for a boundary.

As an example, chroot on its own is not a boundary because attackers can just call chroot again - this requires no vulnerability, it will never be patched, and you need another layer to prevent that. Containers have nothing like that, there is no "just let me out" syscall, you require another vulnerability.

You can read more about user namespaces here:

https://www.man7.org/linux/man-pages/man7/user_namespaces.7....

Using the Dirty Pipe Vulnerability to Break Out from Containers

https://www.datadoghq.com/blog/engineering/dirty-pipe-contai...

Yes? There are a million exploits that allow breaking out of a container. I didn't say it was some impenetrable force field, I said it was a security boundary.
I don't think this is the bottomless pit that you think it is. A virtualised instance is a lot more secure than a container, and it's probably fine to stop at virtualised instances.
A lot more secure? In what ways?
Containers are really a kind of process-isolation - you still share a kernel. You can find a lot of people saying that containers aren’t enough for running untrusted user code.

If you run a fully virtualised instance you get your own kernel and aren’t relying on process isolation.

Would you be happy if your cloud provider was running your containers on the same virtual I stance as someone else’s? Most people wouldn’t be.

The only meaningful difference between breaking out of a process-isolated "container" and a full-blown VM is what's waiting for you outside once you've broken out. Whether it's kernel/OS or a bare metal hypervisor isn't really all that meaningful: exploits and vulnerabilities exist for either.

There should be proper hardware-level isolation here, depending on the scenario. Most cloud companies can't afford that though, because they're not rolling out their own hardware.

> Whether it's kernel/OS or a bare metal hypervisor isn't really all that meaningful: exploits and vulnerabilities exist for either.

This is just not true, or at least it's extremely disingenuous.

Container isolation relies on the Linux kernel. Other than seccomp-denied syscalls (which aren't a thing in k8s by default) any program in the container has full access to the kernel. The Linux kernel has massive attack surface, especially to root users.

VM isolation like Firecracker is much safer. The attack surface is considerably lower. For one thing, you can isolate the process in the guest just as well as you could outside, further limiting attack surface. But more importantly, an attacker either has to attack:

1. Firecracker

2. KVM

Both are very small codebases.

Firecracker is:

1. Written in Rust.

2. Sandboxed aggressively.

KVM has basically never had a public guest to host breakout. You can read about one here, https://googleprojectzero.blogspot.com/2021/06/an-epyc-escap...

So, to recap, we have "security boundary relies on a fully exposed Linux kernel" and "security boundary relies on hardened, tiny, security-driven programs".

It is not even close.

> There should be proper hardware-level isolation here, depending on the scenario. Most cloud companies can't afford that though, because they're not rolling out their own hardware.

Hence hardware building hypervisor support in.

Genuinely, would you be happy with just container isolation between you and other customers of your cloud provider?

Most people absolutely would not.

> "Genuinely, would you be happy with just container isolation between you and other customers of your cloud provider? Most people absolutely would not."

But that's exactly how VPS hosting works today - you don't get your own private blade unless you're ready to pay premium prices and have the competence needed to run them yourself. The technicalities of how private resources in a VPS are isolated from each other will differ, but the concept remains the same nonetheless.

People bite the bullet, only to be subject to things like rowhammer [1], or other container escape scenarios [2].

The top comment in this thread reflects the proper way of dealing with this: containers or sandboxes are may not be treated as a secure boundary.

[1] https://www.usenix.org/conference/usenixsecurity16/technical...

[2] https://www.intezer.com/blog/research/how-we-escaped-docker-...

So I'll start by saying that security is always relative and what's ok for one environment won't be for another :)

The challenge with Linux containers as used by Docker/Containerd/CRI-O et al, is that containers run against a shared Linux kernel. The Linux kernel has a very large attack surface, so it's easier for attackers to find some way to bypass the restrictions it tries to enforce. If you look at this year there have been several Local Privilege Escalation issues in the Linux Kernel, some of which have allowed for container breakout.

If you compare this to a hardened hypervisor (e.g. Firecracker) there is a much smaller attack surface visible from inside the container. It obviously could have a breakout vuln. but there is a lower chance of that occurring.

Developers working with docker are almost always in the 'docker' group on their local machine, which is functionally equivalent to running everything as root.
This doesn't matter if the attacker is in the container. It just means that if the attacker is outside of the container they have a trivial privesc to root on the host.
Opposite - don't mess with sandboxing. Use PaaS services like its > 2008, and let AWS / Google security teams harden their platform.