Hacker News new | ask | show | jobs
by xxpor 1571 days ago
Back in the day, people insisted that containers were not security boundaries and should not be treated as such. They're meant to contain things from going off the rails unintentionally, but an actual threat was another story.

However, realistically, given the env that a container gives you, it certainly looks and feels like a security boundary. So are we just going to be stuck in this retroactive security cleanup mode forever? My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place? If not, is there any realistic way to go from where we are to where we should be?

The only other design I'm familiar with that sort of comes close are MicroVMs. Those have the downside of actually needing to run a VM though, and most (all?) cloud providers don't allow nested virtualization so you're stuck running on an enormous bare metal box.

13 comments

I don't think the industry is moving towards deepening dependence on container/jail interfaces for multitenant workloads --- virtualization has gotten incredibly cheap. So these issues are mostly problems for internal data center segregation and blast radius reduction. It's not nothing, they're important security problems, but unless you're doing something dubious, they shouldn't be existentially important.

There are AWS and GCP instance types with nested virtualization that'll let you run Firecracker. Digital Ocean apparently supports it everywhere.

Slightly pedantic: ec2 doesn't actually support nested virtualization on any instance type I know of, but does have baremetal instance types that support virtualization.

The reason I mention this is because, sadly, baremetal instance types are only ever the largest size of a given family which is cost prohibitive for most users. And even if cost isn't an issue, they take much much longer to start (like 10-20+ minutes) and they actually fail to start far too frequently. It's really a shame that all instance types other than baremetal have virtualization extensions disabled, otherwise we'd be operating far more workloads in firecracker or kata. We operate huge kubernetes clusters so the cost is roughly the same whether it's fewer big instances or more smaller instances, but those startup times and reliability are terrible for autoscaling.

Please, AWS, bring nested virtualization to all nitro instance types!

You can run https://gvisor.dev/ without any virtualization requirement. We use this to host user-submitted configurations (not arbitrary code, but arbitrary input to ~mostly trusted code).

Does this not meet your requirements?

gvisor is awesome and works for particularly untrusted applications, but it's not a performance hit we'd be willing to take across the board and effectively only protects you from security bugs rather than other kernel issues. We run thousands of production database workloads, hundreds of load balancers, thousands web apps, ML jobs, batch processing, etc in kubernetes, most of which require as much performance as possible.

When an EBS volume for a pod goes impaired, if it's using xfs you can basically count the whole server as dead no matter how many xfs + block io timeouts you set. xfs will stop being able to mount/unmount any other filesystems once hung in an unmount call for one. With a proper VM, you'd passthrough the nvme device with pcie passthrough and the host would be totally unimpacted.

Also, gvisor's better mode requires kvm, but it's cool that it effectively functions with ptrace when you can't use kvm.

> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place? If not, is there any realistic way to go from where we are to where we should be?

Yes, because systems that are designed with these kinds of security boundaries in mind already look like containers -- they're a natural match to actual capability-based systems like, for example, plan9's.

The problem here stems entirely from trying to keep these globally-overriding capabilities like CAP_SYS_ADMIN and CAP_DAC_OVERRIDE while also allowing users to create their own namespaces. All these CVEs weren't things as long as only root could create new userns', and now that normal users can all these areas where things weren't checked are coming out of the woodwork.

But a ground up capability-based system avoids this kind of problem by simply making it impossible to elevate to a privilege level like 'root' on POSIX systems, and so namespacing within those systems is incredibly natural to the point that it didn't really get a name (containers) until one was needed for linux' cognitive dissonance around the idea.

You're confusing capabilities systems. Linux capabilities are not "capabilities", they're a misnomer. They're just groupings of privileges.

Here is what capabilities are.

https://en.wikipedia.org/wiki/Capability-based_security

I don't think what you're advocating for makes a ton of sense tbh. You're basically saying "just make it impossible to privesc", which, yeah, that would be nice... but it's not like you can just do that.

I think your point is more that least privilege should be more common - that way exploits have less impact. I agree. That said, Linux Capabilities are extremely coarse, and most container escapes involve owning the Kernel, which from a real Capabilities model would be the trusted broker of capabilities to begin with.

I am not accusing linux of having a real capability system, so nope I'm not confusing them at all. I'm honestly not sure where you got me saying that it does, my tweet is a criticism of linux (or really POSIX) and its lack of true capabilities.

Also, I used plan9 as an example for a reason. The kernel is quite hands off about capabilities in general in plan9, and is definitely not the primary source of trust in the system beyond the fact that a kernel is always a central trust node (some userspace processes like factotum and the authentication server do the real work and hold secure information).

There are systems out there that "just make it impossible to privesc", so it is possible. It's just not really possible within POSIX, because POSIX is built around it.

OK, I apologize - that was my misunderstanding, and I should have worded it as "I think you're confusing" rather than accusatory. I wouldn't hold it against anyone to do so - the naming collision is unfortunate and has been a source of confusion for as long as it has existed.
Oh yeah it is absolutely confusing, and I think it's done real harm to the concept to have it misused in linux so badly.
What do you think about Fuchsia ? It's fully capability-based: https://fuchsia.dev/fuchsia-src/concepts/components/v2/capab...
I'm not sure a year has gone by without a vulnerability that breaks shared-kernel isolation in reasonable configurations. Nobody was going to DAC or MAC out `waitid`, but `waitid` for a time take a kernel address for its siginfo_t parameter.
I didn't mean to imply that there'd never been any kind of "container escape" vuln before userns creation was opened, just the "create userns, escape with magic privs" kind was new and largely because of that change.

(I do think the change will be a net good in the long run, because rootless docker is probably a net improvement, but I think maybe it would have also been a good opportunity to reconsider how they inherit these global capabilities)

It's not a binary thing. I would say something is a boundary if it requires an additional vulnerability to bypass. Containers these days fit that model. The nuance is how strong of a boundary it is.

Containers rely on the Linux kernel. The Linux kernel is shit, in terms of security, for a number of reasons. So all one requires is to own the kernel, and there are a lot of ways to do that. Containers block some system calls and can lower attack surface to a degree, which is great - I think it's a huge win that containers are so popular and, finally, some degree of isolation is widespread.

We'll be stuck in retroactive security mode until developers care to change that, especially ones with influence like kernel maintainers.

> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?

Absolutely not. We'd have ended up with something like Firecracker or GVisor. The issues with containers are fundamental to the concept of having a shared Linux kernel, which is basically what makes a container a container.

> If not, is there any realistic way to go from where we are to where we should be?

Use Firecracker or GVisor.

> Those have the downside of actually needing to run a VM though

I think at this point VMs are not that big of a deal. It's clearly good enough for the vast majority of people who are running on the cloud.

> don't allow nested virtualization so you're stuck running on an enormous bare metal box.

This part is a bummer.

The other option though is to just not care if your OS gets owned. Split your services up, move capabilities across other boundaries like mTLS.

gvisor doesn't require nested virtualization, right? If you're willing to take a tenable user-mode-Linux performance hit, you should be able to run it on anything?
My understanding is that gvisor supports two modes of execution - one with virtualization and one without. AFAIK the official recommendation is to use the one with virtualization, but I've never dug into it.
Yeah, the original mode uses ptrace to intercept system calls, and then just implements the system call itself.
I'll quote Theo deRaadt here, he was talking about virtualization but I would guess the same could be said of containers:

You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can't write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes

Who was he referring to?
No one in particular. He's saying there are no perfect developers so no hypervisors will ever be perfectly secure.
Which is a silly statement, because for all X, no X will ever be perfectly secure. That's why we have multiple layers available and containers and VMs are just one of them.
> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?

It might have looked like FreeBSD jails or Illumos / Solaris Zones. Both of which are containers designed as a security boundary from the start.

I'm here to push back on the fabled security powers of ground-up security-focused shared-kernel isolation. People love to bring up Zones and Jails in these conversations, presumably since both are much more coherent designs than Linux namespaces, MAC, BPF and cgroups, which are now comparably (if not more) featureful, but shambolic and hard to reason about. But none of these systems are sufficient for multitenant isolation. It would not be OK to rely on Zones for a major multitenant compute workload.
> But none of these systems are sufficient for multitenant isolation. It would not be OK to rely on Zones for a major multitenant compute workload.

You can definitely run hostile workloads securely in zones next to each other. Joyent ran a public cloud on zones and there are still smaller cloud providers who do.

In the Sun Solaris days zones were even certified for a bunch of high profile security certifications (if you care about such things).

And Joyent had problems doing that:

https://news.ycombinator.com/item?id=27078349

There's nothing you can do to "certify" zones to mitigate this. The problem is that zone cotenants share a kernel. You have to trust that the kernel attack surface is free of LPEs, and no reasonable person can trust that.

I don't see how bugs of zone escapes and such are necessarily proof of the concept not working.

Chrome also has had its fair share of sandbox escapes and zero-click remote code execution exploits. Does that mean you can't have a browser? I mean by those standards if even Google can't get it right us "mere mortal developers" might as well quit all together.

> The problem is that zone cotenants share a kernel.

Even with a "hardware" VM they share a kernel (it's just called a hypervisor). And while they share that kernel to a lesser extent there are also VM escapes. The VMWare and KVM security advisories are a testimony to that.

The Chrome sandbox would also be problematic for these workloads, for similar reasons! The point of isolated kernels is to foreclose on whole large classes of vulnerabilities. The problem of shared-kernel isolation is that you opt into them.

In the status quo ante of Firecracker, there were colorable arguments that hypervisors had comparably large attack surfaces to containers and jails and zones. But that's mostly out the window now: you can write a mostly memory-safe hypervisor and give it a tiny attack surface by providing only minimal support for virtio devices --- the big challenge with legacy hypervisor stacks is that they were designed to support things like desktop Windows, rather than being scoped down to serverside Linux.

Or HP-UX vaults grown out of Tru64.
> Back in the day, people insisted that containers were not security boundaries and should not be treated as such. They're meant to contain things from going off the rails unintentionally, but an actual threat was another story.

> However, realistically, given the env that a container gives you, it certainly looks and feels like a security boundary.

It has to be secure. Browsers are using pretty much the same technologies (seccomp-bpf, cgroups, namespaces, etc) to tightly sandbox Javascript from websites. Browsers run wildly untrusted code from all over the web, and are expected to pass through many forms of malware, not letting them escape the sandbox.

If containers can't be made secure, we have bigger problems.

> So are we just going to be stuck in this retroactive security cleanup mode forever? My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?

No! Linux and Unix APIs are a mess of patchworks. They are pretty much insecure by default, with rare exceptions.

We could make a new platform with a saner API and make it run on top of Linux, and write new backend services targeting it. I think WASI may just be that. The only problem is that wasm have some overhead / doesn't have access to all CPU features.

I think Unikernel VMs are the future. Build your app into One blob with no user/kernel space boundary that runs in a guest VM. No boot time or wasted memory/latency (context switch) issues.

That said, even VM are best-effort security boundaries, then apparmor/selinux type restrictions put in place on the host should be the main hard security boundary IMO.

Good luck debugging that.
Shouldn't need luck. It wouldn't use qemu or vmware but a specialized VM manager that will interface with it via network/virtual-hardware and expose a virtual file system to it (e.g.: it will call "read()" but instead of glibc wrapping a syscall, a compiled in wrapper would ask the hypervisor to "read()", except it would just memcpy() around a file opened at virtual boot instead of asking the kernel to read a file and then send the data back while avoiding context switch and just send request, wait for interrupt).

File system, networking and security need to abstracted in a way that is ideal for performance and introspection, specifically for a unikernel built to interface with the abstraction.

> They're meant to contain things from going off the rails unintentionally, but an actual threat was another story.

I disagree with that idea. The actual that may be as limited in capabilities as a standard bug. Let's say you have a problem with your webapp where you can read an arbitrary file, but nothing else. Containers are a perfect protection in this case if you want to isolate the app from any other services running on the host (monitoring, provisioning, etc.).

There's no perfection and defence in depth is what we need to use everywhere. Unless you can break through all layers at the same time, imperfect layers are a valid improvement. See how many default protections you have to turn off to even make this bug viable.

At least some of the Azure series support nested virtualization. See https://docs.microsoft.com/en-us/azure/virtual-machines/dv4-.... There are a lot of series and I don’t know the breakdown but I would expect dsv4 to be one of the more widely used options because it is for generic CPU workloads.
> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?

Yes. The only difference is the Linux based systems and tools as opposed to Zones or Jails were the first pivot to a developer focused view rather than that of the sysadmin. This utility is why containers gained critical mass, not because the security focused foundations of other implementations was an impediment.

When developer productivity come before sysadims that is when security goes south, as history has proven on desktop systems.
With Spectre we discovered that not even VMs are adequate security boundaries.

My opinion: want security? Separate (bare metal) machines. Period.

...and in the spirit of the parent comment, Intel didn't intend for protected mode to be a security boundary either. The 286 and 386 programming manuals referred to the protections as a form of reducing the severity of bugs.
How are they not a security boundary? Nearly everything is a security boundary using defense in depth no?
Security boundaries in Linux are UIDs/GIDs, capabilities, SELinux domains, and others. These can be applied to processes regardless of whether the process runs in a container.

i.e. root inside a container is root on the host; the container itself doesn't help that. But other security features, that are applied to the processes within the container when the container is created, might.