Hacker News new | ask | show | jobs
by stormbrew 1571 days ago
> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place? If not, is there any realistic way to go from where we are to where we should be?

Yes, because systems that are designed with these kinds of security boundaries in mind already look like containers -- they're a natural match to actual capability-based systems like, for example, plan9's.

The problem here stems entirely from trying to keep these globally-overriding capabilities like CAP_SYS_ADMIN and CAP_DAC_OVERRIDE while also allowing users to create their own namespaces. All these CVEs weren't things as long as only root could create new userns', and now that normal users can all these areas where things weren't checked are coming out of the woodwork.

But a ground up capability-based system avoids this kind of problem by simply making it impossible to elevate to a privilege level like 'root' on POSIX systems, and so namespacing within those systems is incredibly natural to the point that it didn't really get a name (containers) until one was needed for linux' cognitive dissonance around the idea.

3 comments

You're confusing capabilities systems. Linux capabilities are not "capabilities", they're a misnomer. They're just groupings of privileges.

Here is what capabilities are.

https://en.wikipedia.org/wiki/Capability-based_security

I don't think what you're advocating for makes a ton of sense tbh. You're basically saying "just make it impossible to privesc", which, yeah, that would be nice... but it's not like you can just do that.

I think your point is more that least privilege should be more common - that way exploits have less impact. I agree. That said, Linux Capabilities are extremely coarse, and most container escapes involve owning the Kernel, which from a real Capabilities model would be the trusted broker of capabilities to begin with.

I am not accusing linux of having a real capability system, so nope I'm not confusing them at all. I'm honestly not sure where you got me saying that it does, my tweet is a criticism of linux (or really POSIX) and its lack of true capabilities.

Also, I used plan9 as an example for a reason. The kernel is quite hands off about capabilities in general in plan9, and is definitely not the primary source of trust in the system beyond the fact that a kernel is always a central trust node (some userspace processes like factotum and the authentication server do the real work and hold secure information).

There are systems out there that "just make it impossible to privesc", so it is possible. It's just not really possible within POSIX, because POSIX is built around it.

OK, I apologize - that was my misunderstanding, and I should have worded it as "I think you're confusing" rather than accusatory. I wouldn't hold it against anyone to do so - the naming collision is unfortunate and has been a source of confusion for as long as it has existed.
Oh yeah it is absolutely confusing, and I think it's done real harm to the concept to have it misused in linux so badly.
What do you think about Fuchsia ? It's fully capability-based: https://fuchsia.dev/fuchsia-src/concepts/components/v2/capab...
I'm not sure a year has gone by without a vulnerability that breaks shared-kernel isolation in reasonable configurations. Nobody was going to DAC or MAC out `waitid`, but `waitid` for a time take a kernel address for its siginfo_t parameter.
I didn't mean to imply that there'd never been any kind of "container escape" vuln before userns creation was opened, just the "create userns, escape with magic privs" kind was new and largely because of that change.

(I do think the change will be a net good in the long run, because rootless docker is probably a net improvement, but I think maybe it would have also been a good opportunity to reconsider how they inherit these global capabilities)