Hacker News new | ask | show | jobs
by dicroce 441 days ago
I actually wish that instead of docker & etc we had just gotten a better chroot... Or maybe just a new kernel syscall that is chroot()++.
16 comments

There seems to be a fundamental mismatch between how sane people think about sandboxing, and how linux manages namespaces.

A linux-naive developer would expect to spawn a new process from a payload with access to nothing. It can't see other processes, it has a read only root with nothing in it, there are no network devices, no users, etc. Then they would expect to read documentation to learn how to add things to the sandbox. They want to pass in a directory, or a network interface, or some users. The effort goes into adding resources to the sandbox, not taking them away.

Instead there is this elaborate ceremony where the principal process basically spawns another version of itself endowed with all the same privileges and then gives them up, hopefully leaving itself with only the stuff it wants the sandboxed process to have. Make sure you don't forget to revoke anything.

> a read only root with nothing in it

A lot of things break if there's no /proc/self. A lot more things break if the terminfo database is absent. More things break if there's no timezone database. Finally, almost everything breaks if the root file system has no libc.so.6.

When you write Dockerfiles, you can easily do it FROM scratch. You can then easily observe whether the thing you are sandboxing actually works.

> no users

Now you are breaking something as fundamental as getuid.

The modern statically linked languages (I'm thinking of Go and Zig specifically) increasingly need less and less of the cruft you mentioned. Hopefully, that trend continues.

> no users

I mean running as root. I think all processes on Linux have to have a user id. Anything inside a sandbox should start with all the permissions for that environment. If the sandbox process wants to muck around with the users/groups authorization model then it can create those resources inside the sandbox.

The things that break in C if /proc/self or the terminfo DB are missing will break in Go and Zig too.

What I think you might mean is something like: "in modern statically linked applications written with languages like Go and Zig, it is much less likely for the them to call on OS services that require these sorts of resources".

That is pretty much what jails are in FreeBSD, especially thin jails.
Or capabilities. Additive security has been known for decades; Linux really dropped the ball here. Linux file descriptors (open file descriptions, whatever) are close to a genuine capability model, except there's plenty of leakage where you can get at the insecure base.
> Instead there is this elaborate ceremony where the principal process basically spawns another version of itself endowed with all the same privileges and then gives them up

The flags to unshare are copies of clone3 args, so you're actually free to do this. There's some song and dance though, because it's not actually possible to exec an arbitrary binary will access to nothing.

But I think the big discrepancy is that there is inherently a two step process to "spawn a new process with a new executable." Doesn't work that way - you clone3/fork into a new child process, inheriting what you will from the parent based on the clone args/flags (which could be everything, could be nothing), do some setup work, and then exec.

> There seems to be a fundamental mismatch between how sane people think about sandboxing, and how linux manages namespaces.

What bothers me most about sandboxing with linux namespaces is that edge cases keep turning up that allow them to trick the kernel into granting more privileges than it should.

I wonder if Landlock can/will bring something more like FreeBSD jails to the table. (I haven't made time to read about it in detail yet.)

This is why I would still rather isolate using QEMU, docker, or Virtually Box rather than a very think chroot-like environment
Docker uses namespaces by default. Are you using an add-on that makes it use a hypervisor instead?
I believe this is because on POSIX systems the only way to create a new process is fork().
There is the later added posix_spawn, which could be implemented with a system call, even if on Linux it is emulated with clone + exec.

posix_spawn can do much, but not all, of what is possible with clone + exec. Presumably the standard editors have been scared to add too complex function parameters for its invocation, though that should not have been a problem if all parameters had reasonable default values.

Come to FreeBSD, we have just that - jails.
yup! FreeBSD jails are essentially what OP wants with chroot++.

I was pretty puzzled when Docker and LXC came around as this whole new thing believed to have "never been done before"; FreeBSD had supported a very similar concept for years before security groups were added in Linux.

Jails and ezjail were stellar to make mini no-overhead containers when running various services on a server. Being able to archive them and expand them on a new machine was also pretty cool (as long as the BSD version was the same.)

this whole new thing believed to have "never been done before";

Nobody with knowledge of sandboxing believed this, Virtuozzo and later OpenVZ had been on Linux for a long time after all. Virtuozzo was even from a similar time frame as FreeBSD jails (2000-ish).

The key innovation of Docker was to provide a standardized way to build, distribute, and run container images.

Virsh had worked for a long time before docker came around, but yeah… you essentially had to build your own Docker-like infrastructure that only you were using
Solaris Zones too. Absolute magic, many years before Docker and friends.
We kind of did but its all put in the context of containers. Check out the unshare command.

unshare --mount

Most examples you'll find put it in the context of containers, like https://www.redhat.com/en/blog/mount-namespaces

Apples and oranges.

Among many other things, Docker (and Podman etc) has

1. Images and OverlayFS

2. Networking

3. User namespace mappings

4. Resource management

---

If all you want is file system isolation, then docker (and postman, etc) is massive overkill chroot is correct.

*podman, etc
Plan9 had a proper solution for this. New processes don't get access to any files by default - you have to explicitly mount directories for them, capability style.

Shame Plan9 blew its weirdness budget.

It has nothing to do with weirdness; Unix itself was plenty weird for its time. The relevant difference between Unix™ and Plan 9 is that Unix source code was given away (or cheaply licensed) to hardware companies which all wrote their own operating systems on top (SunOS, Ultrix, HP-UX, etc. etc.). This made Unix the common factor of very many commercial workstation environments. Plan 9? It was sold directly as a commercial product, for no hardware platform in particular. Nobody wanted to buy it.

People liked Unix because it was free – either really free, via BSD, or as a Unix derivative provided at no cost when people bought their workstations. A new revolutionary operating system had absolutely no reason for anybody to buy it: No commercial developers wanted to develop to a platform without users, and no users wanted a platform without software.

Plan 9 only changed their license many years later, when it was too late for anybody to care, and Unix had become the established standard.

rfork it's easy :D
Pids and cgroups all the way down (also why the wise greybeards rejected docker)
As a number of comments have noted, there are a bunch of different axes that chroot could be 'better' on - e.g. security and sandboxing.

I wrote https://github.com/aidanhs/machroot (initially forked from bubble wrap) a while ago to lean into the pure "pretend I see another filesystem" aspect of chroot with additional conveniences (so no security focus). For example, it allows setting up overlay filesystems, allows mounting squashfs filesystems with an overlay on top...and because it uses a mount namespace, means you don't need to tear down the mount points - just exit the command and you're done.

The codebase is pretty small so I just tweaked it with whatever features I needed at the time, rather than try and make it a fully fledged tool.

(honestly you can probably replicate most of it with a shell script that invokes unshare and appropriate mount commands)

systemd-nspawn is probably what you want.
I built https://github.com/jrz/container-shell as a simple chroot on docker. I use it for both development and sandboxing.

Besides that I have a simple script that starts an ephemeral docker with debian-full + tools.

I also have some scripts that leverage macOS's 'sandbox-exec'

Isn't LXC more or less an unsupervised chroot in an isolated process?
Yes. And so is bubblewrap - if security through isolation is a priority.
Working on proot-docker, a bash script on top of skopeo and proot:

https://github.com/mtseet/proot-docker

We need more people to improve it!

How do you imagine a "better chroot" would differ from a Docker container? Seems to me like that's exactly what Docker is.
ironically docker never gave you true network isolation because there's no way to make it user friendly. plus the many exploits on the all powerful daemon.

but most professional world use systemd to bootstrap isolated processes nowadays, which is kinda if what you are hinting at. cgroups2 and namespaces are what you want.

We should get a better change root yes, and the LD_LIBRARY_PATH gets autoupdated with respect to the new root. A few flags here and there to set permissions of the child process and we're off too the races.

Oh wow, wow. That all sounded so intensely complex, incomprehensible. What we are going to need to do is build a program to handle all that, highly formalized. Let's make it so formalized it's one of those things like taxes or AWS where people can just make a living from understanding the beast. It can be like systemd meets multics meets java. have it's own various complicated commands, complicated file formats, and so on. The chroot() is only historically understood by everyone, so let's steal a page from the java playbook and just rename everything with our own terminology. The product will be so outstanding, wow, I call it "Shocker"

what would "better chroot" do?
I'm not GP, but if I were to hazard a guess, they want something more than just mount space isolation. Something akin to BSD jails, without the bells and whistles of OCI containers like overlay filesystem, network virtualization, resource management, etc.

That requirement is pretty legitimate, since its easier and suitable enough for many applications for which we currently use OCI containers. For example, isolated builds, development environments, sandboxes etc. (I have an isolated build tool for Gentoo).

But Linux already has multiple solutions that fit the bill, like systemd-nspawn, LXC, bubblewrap, etc. Too bad, they aren't as widely known as chroot.

None of those things do what chroot does but many of them involve chroot - so I'm still not grasping what "better chroot" is, other than "not chroot, but something completely different."

It sounds like people want "better exec"

One annoying part of using chroot if you're creating them on the fly is teardown - you have to manually invoke umount, and also take care to get this right for partially created chroots (maybe you detected an error after mounting proc, in the process of getting other files in place).

This was my original motivation in creating machroot (mentioned elsewhere in this thread) and having it use namespaces.

what Solaris and now illumos zones do[1]

[1] https://www.usenix.org/legacy/event/lisa04/tech/full_papers/...

Docker is not using chroot in any way. Escaping chroot is only a matter of calling chdir('..') and poof, you are out of the "sandbox"