Hacker News new | ask | show | jobs
by cyphar 32 days ago
1. The privilege check in question here is capable(CAP_NET_ADMIN), so it doesn't work in user namespaces.

2. Most sandboxes (including Docker and Podman) disable creating unprivileged user namespaces inside them via seccomp. In this mode, you end up with a more secure setup than requiring a privileged process to spawn containers (for one, it massively reduces the risk of confused deputy attacks against container runtimes). You can also restrict it with ucounts (as rough of a system as that is).

3. The kernel provides this facility and the feature was added back in early 2013 (before Docker existed and long before they added user namespace support, let alone rooless containers), so I don't understand why you think this is somehow the fault of OCI? We're just making something useful out of existing kernel infrastructure. Folks have asked the kernel to provide a knob to disable unprivileged user namespaces but the maintainer has refused to do so for years (the best you get is ucounts and seccomp). I would also prefer to have such a knob (or even adding a separate ucount with configurable per-user limits) but it's not up to me.

(Disclaimer: I implemented rootless containers for runc back in the day and work on OCI, so I do have some bias here.)

1 comments

1) the various projects refused even simple requests like allowing the admin to disable the —privileged flag, in the rootfull days. 2) The choice to break out CRI will zero authorization or mutations at the CRI level, while understandable to the containerd teams needs, exposed every other runtime to an unprotected alternative communication path. 3) The OCI groups refusal to provide guidance to LSM maintainers as to minimal configurations, while also handling the responsibilities of seccomp profiles to end users means only actively attacked vectors are protected and it becomes impossible for normal users to operate safely. 4) under the UNIX model it is the caller to clone/fork/unshare that must drop privileges. 5) This model was set in concrete by the OCI standards and now suffers from the frozen caveman pattern.

The capable()[0] syscall operates as one would expect for granting superior capabilities, and while the work to expand the isolation is something I am sure you are familiar with, you probably also realize that the number of entries in a default user also expanded just to support user namespaces.

But to be clear, the choices that docker/oci made are understandable from a local greedy choice perspective, it complicates the entire user space.

K8s mutating inlet controllers are a symptom of those choices.

Had a CRI contained a bounding set, enforced at a system level, especially with guidance and tools for users to use a minimal set, which they could expand on easily we would be in a better spot.

But as other projects cannot provide meaningful protections that cannot be simply bypassed by calling privileged CRIs it is also a barrier to convincing them to do the same.

Really there is a larger problem that OCI could be the leader on, but they are the ‘killer app’ and refuse to do so.

The bounding set for user capabilities is driven by containers, and while namespaces are not and never have been a security feature, this blocks their ability to have a strong security posture.

To be clear, expecting every end user to write minimal seccomp profiles is unrealistic, especially when docker prevents devs from accessing the local machine to discover what is happening. I think podman is the only machine that allows that by default.

Basically while simplifying moby/containerd/CRI is an understandable choice, the refusal to address the costs of that local optim has fallout

[0] https://elixir.bootlin.com/linux/v7.0.5/source/kernel/capabi...