Hacker News new | ask | show | jobs
by chrisseaton 1535 days ago
I don't think this is the bottomless pit that you think it is. A virtualised instance is a lot more secure than a container, and it's probably fine to stop at virtualised instances.
1 comments

A lot more secure? In what ways?
Containers are really a kind of process-isolation - you still share a kernel. You can find a lot of people saying that containers aren’t enough for running untrusted user code.

If you run a fully virtualised instance you get your own kernel and aren’t relying on process isolation.

Would you be happy if your cloud provider was running your containers on the same virtual I stance as someone else’s? Most people wouldn’t be.

The only meaningful difference between breaking out of a process-isolated "container" and a full-blown VM is what's waiting for you outside once you've broken out. Whether it's kernel/OS or a bare metal hypervisor isn't really all that meaningful: exploits and vulnerabilities exist for either.

There should be proper hardware-level isolation here, depending on the scenario. Most cloud companies can't afford that though, because they're not rolling out their own hardware.

> Whether it's kernel/OS or a bare metal hypervisor isn't really all that meaningful: exploits and vulnerabilities exist for either.

This is just not true, or at least it's extremely disingenuous.

Container isolation relies on the Linux kernel. Other than seccomp-denied syscalls (which aren't a thing in k8s by default) any program in the container has full access to the kernel. The Linux kernel has massive attack surface, especially to root users.

VM isolation like Firecracker is much safer. The attack surface is considerably lower. For one thing, you can isolate the process in the guest just as well as you could outside, further limiting attack surface. But more importantly, an attacker either has to attack:

1. Firecracker

2. KVM

Both are very small codebases.

Firecracker is:

1. Written in Rust.

2. Sandboxed aggressively.

KVM has basically never had a public guest to host breakout. You can read about one here, https://googleprojectzero.blogspot.com/2021/06/an-epyc-escap...

So, to recap, we have "security boundary relies on a fully exposed Linux kernel" and "security boundary relies on hardened, tiny, security-driven programs".

It is not even close.

> There should be proper hardware-level isolation here, depending on the scenario. Most cloud companies can't afford that though, because they're not rolling out their own hardware.

Hence hardware building hypervisor support in.

Genuinely, would you be happy with just container isolation between you and other customers of your cloud provider?

Most people absolutely would not.

> "Genuinely, would you be happy with just container isolation between you and other customers of your cloud provider? Most people absolutely would not."

But that's exactly how VPS hosting works today - you don't get your own private blade unless you're ready to pay premium prices and have the competence needed to run them yourself. The technicalities of how private resources in a VPS are isolated from each other will differ, but the concept remains the same nonetheless.

People bite the bullet, only to be subject to things like rowhammer [1], or other container escape scenarios [2].

The top comment in this thread reflects the proper way of dealing with this: containers or sandboxes are may not be treated as a secure boundary.

[1] https://www.usenix.org/conference/usenixsecurity16/technical...

[2] https://www.intezer.com/blog/research/how-we-escaped-docker-...

No, VPS hosting is not usually container-based today once you leave the utter bargain-bin offers. The difference between VM isolation and container isolation is quite significant.
> But that's exactly how VPS hosting works today

No, VPS is isolation by virtualisation, not containerisation.

The clue is in the V in the name.

So I'll start by saying that security is always relative and what's ok for one environment won't be for another :)

The challenge with Linux containers as used by Docker/Containerd/CRI-O et al, is that containers run against a shared Linux kernel. The Linux kernel has a very large attack surface, so it's easier for attackers to find some way to bypass the restrictions it tries to enforce. If you look at this year there have been several Local Privilege Escalation issues in the Linux Kernel, some of which have allowed for container breakout.

If you compare this to a hardened hypervisor (e.g. Firecracker) there is a much smaller attack surface visible from inside the container. It obviously could have a breakout vuln. but there is a lower chance of that occurring.

Developers working with docker are almost always in the 'docker' group on their local machine, which is functionally equivalent to running everything as root.
This doesn't matter if the attacker is in the container. It just means that if the attacker is outside of the container they have a trivial privesc to root on the host.