Hacker News new | ask | show | jobs
by alphazard 441 days ago
There seems to be a fundamental mismatch between how sane people think about sandboxing, and how linux manages namespaces.

A linux-naive developer would expect to spawn a new process from a payload with access to nothing. It can't see other processes, it has a read only root with nothing in it, there are no network devices, no users, etc. Then they would expect to read documentation to learn how to add things to the sandbox. They want to pass in a directory, or a network interface, or some users. The effort goes into adding resources to the sandbox, not taking them away.

Instead there is this elaborate ceremony where the principal process basically spawns another version of itself endowed with all the same privileges and then gives them up, hopefully leaving itself with only the stuff it wants the sandboxed process to have. Make sure you don't forget to revoke anything.

5 comments

> a read only root with nothing in it

A lot of things break if there's no /proc/self. A lot more things break if the terminfo database is absent. More things break if there's no timezone database. Finally, almost everything breaks if the root file system has no libc.so.6.

When you write Dockerfiles, you can easily do it FROM scratch. You can then easily observe whether the thing you are sandboxing actually works.

> no users

Now you are breaking something as fundamental as getuid.

The modern statically linked languages (I'm thinking of Go and Zig specifically) increasingly need less and less of the cruft you mentioned. Hopefully, that trend continues.

> no users

I mean running as root. I think all processes on Linux have to have a user id. Anything inside a sandbox should start with all the permissions for that environment. If the sandbox process wants to muck around with the users/groups authorization model then it can create those resources inside the sandbox.

The things that break in C if /proc/self or the terminfo DB are missing will break in Go and Zig too.

What I think you might mean is something like: "in modern statically linked applications written with languages like Go and Zig, it is much less likely for the them to call on OS services that require these sorts of resources".

That is pretty much what jails are in FreeBSD, especially thin jails.
Or capabilities. Additive security has been known for decades; Linux really dropped the ball here. Linux file descriptors (open file descriptions, whatever) are close to a genuine capability model, except there's plenty of leakage where you can get at the insecure base.
> Instead there is this elaborate ceremony where the principal process basically spawns another version of itself endowed with all the same privileges and then gives them up

The flags to unshare are copies of clone3 args, so you're actually free to do this. There's some song and dance though, because it's not actually possible to exec an arbitrary binary will access to nothing.

But I think the big discrepancy is that there is inherently a two step process to "spawn a new process with a new executable." Doesn't work that way - you clone3/fork into a new child process, inheriting what you will from the parent based on the clone args/flags (which could be everything, could be nothing), do some setup work, and then exec.

> There seems to be a fundamental mismatch between how sane people think about sandboxing, and how linux manages namespaces.

What bothers me most about sandboxing with linux namespaces is that edge cases keep turning up that allow them to trick the kernel into granting more privileges than it should.

I wonder if Landlock can/will bring something more like FreeBSD jails to the table. (I haven't made time to read about it in detail yet.)

This is why I would still rather isolate using QEMU, docker, or Virtually Box rather than a very think chroot-like environment
Docker uses namespaces by default. Are you using an add-on that makes it use a hypervisor instead?
I believe this is because on POSIX systems the only way to create a new process is fork().
There is the later added posix_spawn, which could be implemented with a system call, even if on Linux it is emulated with clone + exec.

posix_spawn can do much, but not all, of what is possible with clone + exec. Presumably the standard editors have been scared to add too complex function parameters for its invocation, though that should not have been a problem if all parameters had reasonable default values.