| Thank you for taking the time to reply - happy to discuss this! :) > It's fair to say that gVisor performs better isolation on syscalls, but it's also fair to say that by adding Linux user-ns and procfs & sysfs emulation, Sysbox isolates the container in ways that gVisor does not. Have a look at what gVisor actually does: https://gvisor.dev/docs/architecture_guide/security It fully implements a subset of the Linux kernel ABI in userspace, including procfs and sysfs and even memory and process management. No untrusted code ever interacts with the host kernel. Filesystem and network access goes through an IPC protocol and is handled by the gVisor processes on the host, which in turns runs inside a user namespace and a seccomp sandbox for defense in depth. This is a much, much stronger level of isolation than your approach or, arguably, even VMs (the trade-off is performance). "Sysbox isolates the container in ways that gVisor does not" just isn't true. The sysbox approach is one kernel bug away from host system compromise, same as using regular containers. Emulating procfs and sysfs and using user namespaces takes away some of the attack surface and is great defense in depth, but does not provide isolation from the host kernel. > Also, note that Sysbox is not meant to isolate workloads in multi-tenant environments (for that we think VM-based approaches are better) I've read numerous claims that sysbox is suitable for untrusted workloads, for instance in [1] and [2]. It's a nice product and certainly much, much better than running docker-in-docker using privileged containers, but given the significant remaining attack surface, this claim could put your customers at risk and should come with a big disclaimer. > While QEMU may be faster than Firecracker in some metrics in a one-instance comparison, when you start running dozens of instances per host, the overhead of the full VM (particularly memory overhead) hurts its performance (which is the reason Firecracker was designed) Firecracker was designed for memory efficiency, faster cold start times and security (by virtue of being written in a memory-safe language). It means you can run more containers per host, but the actual workload performance overhead is identical to "normal" VMs and, in some cases, even slightly higher since Firecracker lacks some of the optimization that has gone into QEMU. > Regarding gVisor performance, we didn't do a full performance comparison vs. KubeVirt, so we may stand corrected if gVisor is in fact slower than KubeVirt when running multiple instances on the same host (would appreciate any more info you may have on such a comparison, we could not find one). KubeVirt is just plain QEMU VMs using libvirt, which have been compared to gVisor quite extensively[3][4]. There's almost no overhead for memory/CPU and quite a lot of overhead for syscalls (but with big improvements recently with the introduction of VFS2 and soon LisaFS[5]). It's a classic trade-off - gVisor is more secure and efficient than QEMU, allowing a much larger number of instances to run on a host by virtue of better cooperation with the host kernel scheduler and memory management, but for raw performance, a QEMU VM always wins. > Regarding the claim that standard containers cannot run a full OS, what the table in the GH repo is indicating is that Sysbox allows you to create unprivileged containers (or pods) that can run system software such as Docker, Kubernetes, k3s, etc. with good isolation and seamlessly (no privileged container, no changes in the software inside the container, and no tricky container entrypoints). To the best of our knowledge, it's not possible to run say Kubernetes inside a regular container unless it's a privileged container with a custom entrypoint. Or inside a Firecracker VM. If you know otherwise, please let us know. Firecracker runs a full Linux kernel inside the VM, so it could always run regular Docker, Kubernetes or anything else. See [6] for a practical example. For containers, this used to be the case, but the situation improved in recent kernel releases. For podman, almost every combination works - running systemd unprivileged, running podman inside podman, or even running rootless-podman-in-rootless-podman[7] and so does Kubernetes-in-rootless-{podman,docker}[8] (requiring very recent kernel features, though - notably cgroupsv2 and unprivileged overlayfs). Running docker:dind-rootless inside unprivileged Docker containers also works, however, it requires "--security-opt seccomp=unconfined". Sysbox definitely got to that point earlier and has better usability. > - Regarding "The claim that their solution offers large security improvements over any other solution with user namespaces isn't true". Where do you see that claim? The table explicitly states that there are solutions that provide stronger isolation. Apologies, then, for misinterpreting that. [1]: https://blog.nestybox.com/2020/10/06/related-tech-comparison... [2]: https://github.com/nestybox/sysbox/issues/120#issuecomment-9... [3]: https://object-storage-ca-ymq-1.vexxhost.net/swift/v1/6e4619... [4]: https://www.scitepress.org/Papers/2021/104405/104405.pdf [5]: https://gvisor.dev/blog/2021/12/02/running-gvisor-in-product... [6]: https://github.com/innobead/kubefire [7]: https://www.redhat.com/sysadmin/podman-inside-container [8]: https://kind.sigs.k8s.io/docs/user/rootless |
> Have a look at what gVisor actually does
I am aware of what it does, though I had missed the fact that the Sentry and/or Gopher run within a user-ns (could not find this in the docs). Had also missed the fact that it does perform procfs/sysfs emulation (makes sense), so I stand corrected on that. In light of this, I'll modify the Sysbox GH table to show gVisor as having a stronger isolation rating (in fact, our Sysbox blog comparing technologies [1] did give gVisor a stronger isolation rating).
> the sysbox approach is one kernel bug away from host system compromise
All approaches are one bug away from host system compromise (gVisor, VMs, etc.), though I agree that approaches like gVisor and VMs have a reduced attack surface.
> I've read numerous claims that sysbox is suitable for untrusted workloads
It's not a black or white determination in my view. Users choose based on their environments & needs. We always make it clear to our users that VM-based approaches provide stronger isolation, per the Sysbox GH repo:
"Isolation wise, it's fair to say that Sysbox containers provide stronger isolation than regular Docker containers (by virtue of using the Linux user-namespace and light-weight OS shim), but weaker isolation than VMs (by sharing the Linux kernel among containers)."
> Firecracker runs a full Linux kernel inside the VM, so it could always run regular Docker, Kubernetes or anything else
That's good to know (thanks), though the table in the Sysbox GH repo meant to compare Sysbox against Kata + Firecracker (since Kata is a container runtime). To the best of my knowledge running Docker, K8s, k3s, etc. inside a Kata container is not easy (see [1] and [2]).
> For containers, this used to be the case, but the situation improved in recent kernel releases.
It's correct that rootless docker/podman approaches are improving as far as what workloads they can run inside containers, although they still have several limitations [3], [4].
With Sysbox, most of these limitations don't apply because the solution works at the more basic "runc" level, Sysbox itself is rootful, and it uses some of the techniques I mentioned before (user-ns, procfs & sysfs virtualization, syscall trapping, UID-shifting, etc.) to make the container resemble a "real host" while providing good isolation.
Good discussion, please let me know of any more feedback.
[1] https://github.com/kata-containers/kata-containers/issues/20... [2] https://github.com/daniel-noland/docker-in-kata [3] https://docs.docker.com/engine/security/rootless/#known-limi... [4] https://github.com/containers/podman/blob/main/rootless.md