| Thanks for the feedback; I am one of the developers of Sysbox. Some answers to the above comments: - Regarding the container isolation, Sysbox uses a combination of Linux user-namespace + partial procfs & sysfs emulation + intercepting some sensitive syscalls in the container (using seccomp-bpf). It's fair to say that gVisor performs better isolation on syscalls, but it's also fair to say that by adding Linux user-ns and procfs & sysfs emulation, Sysbox isolates the container in ways that gVisor does not. This is why we felt it was fair to put Sysbox at a similar isolation rating as gVisor, although if you view it from purely a syscall isolation perspective it's fair to say that gVisor offers better isolation. Also, note that Sysbox is not meant to isolate workloads in multi-tenant environments (for that we think VM-based approaches are better). But in single-tenant environments, Sysbox does void the need for privileged containers in many scenarios because it allows well isolated containers/pods to run system workloads such as Docker and even K8s (which is why it's often used in CI infra). - Regarding the speed rating, we gave Firecracker a higher speed rating than KubeVirt because while they both use hardware virtualization, the latter run microVMs that are highly optimized and have much less overhead that full VMs that typically run on KubeVirt. While QEMU may be faster than Firecracker in some metrics in a one-instance comparison, when you start running dozens of instances per host, the overhead of the full VM (particularly memory overhead) hurts its performance (which is the reason Firecracker was designed). - Regarding gVisor performance, we didn't do a full performance comparison vs. KubeVirt, so we may stand corrected if gVisor is in fact slower than KubeVirt when running multiple instances on the same host (would appreciate any more info you may have on such a comparison, we could not find one). - Regarding the claim that standard containers cannot run a full OS, what the table in the GH repo is indicating is that Sysbox allows you to create unprivileged containers (or pods) that can run system software such as Docker, Kubernetes, k3s, etc. with good isolation and seamlessly (no privileged container, no changes in the software inside the container, and no tricky container entrypoints). To the best of our knowledge, it's not possible to run say Kubernetes inside a regular container unless it's a privileged container with a custom entrypoint. Or inside a Firecracker VM. If you know otherwise, please let us know. - Regarding "The claim that their solution offers large security improvements over any other solution with user namespaces isn't true". Where do you see that claim? The table explicitly states that there are solutions that provide stronger isolation. - Regarding "The isolation offered by user namespaces is still very weak and not comparable to gVisor or Firecracker". User namespaces by itself mitigates several recent CVEs for containers, so it's a valuable feature. It may not offer VM-level isolation, but that's not what we are claiming. Furthermore, Sysbox uses the user-ns as a baseline, but adds syscall interception and procfs & sysfs emulation to further harden the isolation. - "False marketing is a big red flag, especially for something as critical as a container runtime." That's not what we are doing. - Rootless Docker/Podman are great, but they work at a different level than Sysbox. Sysbox is an enhanced "runc", and while Sysbox itself runs as true root on the host (i.e., Sysbox is not rootless), the containers or pods it creates are well isolated and void the need for privileged containers in many scenarios. This is why several companies use it in production too. |
> It's fair to say that gVisor performs better isolation on syscalls, but it's also fair to say that by adding Linux user-ns and procfs & sysfs emulation, Sysbox isolates the container in ways that gVisor does not.
Have a look at what gVisor actually does: https://gvisor.dev/docs/architecture_guide/security
It fully implements a subset of the Linux kernel ABI in userspace, including procfs and sysfs and even memory and process management. No untrusted code ever interacts with the host kernel. Filesystem and network access goes through an IPC protocol and is handled by the gVisor processes on the host, which in turns runs inside a user namespace and a seccomp sandbox for defense in depth.
This is a much, much stronger level of isolation than your approach or, arguably, even VMs (the trade-off is performance). "Sysbox isolates the container in ways that gVisor does not" just isn't true.
The sysbox approach is one kernel bug away from host system compromise, same as using regular containers. Emulating procfs and sysfs and using user namespaces takes away some of the attack surface and is great defense in depth, but does not provide isolation from the host kernel.
> Also, note that Sysbox is not meant to isolate workloads in multi-tenant environments (for that we think VM-based approaches are better)
I've read numerous claims that sysbox is suitable for untrusted workloads, for instance in [1] and [2].
It's a nice product and certainly much, much better than running docker-in-docker using privileged containers, but given the significant remaining attack surface, this claim could put your customers at risk and should come with a big disclaimer.
> While QEMU may be faster than Firecracker in some metrics in a one-instance comparison, when you start running dozens of instances per host, the overhead of the full VM (particularly memory overhead) hurts its performance (which is the reason Firecracker was designed)
Firecracker was designed for memory efficiency, faster cold start times and security (by virtue of being written in a memory-safe language). It means you can run more containers per host, but the actual workload performance overhead is identical to "normal" VMs and, in some cases, even slightly higher since Firecracker lacks some of the optimization that has gone into QEMU.
> Regarding gVisor performance, we didn't do a full performance comparison vs. KubeVirt, so we may stand corrected if gVisor is in fact slower than KubeVirt when running multiple instances on the same host (would appreciate any more info you may have on such a comparison, we could not find one).
KubeVirt is just plain QEMU VMs using libvirt, which have been compared to gVisor quite extensively[3][4]. There's almost no overhead for memory/CPU and quite a lot of overhead for syscalls (but with big improvements recently with the introduction of VFS2 and soon LisaFS[5]). It's a classic trade-off - gVisor is more secure and efficient than QEMU, allowing a much larger number of instances to run on a host by virtue of better cooperation with the host kernel scheduler and memory management, but for raw performance, a QEMU VM always wins.
> Regarding the claim that standard containers cannot run a full OS, what the table in the GH repo is indicating is that Sysbox allows you to create unprivileged containers (or pods) that can run system software such as Docker, Kubernetes, k3s, etc. with good isolation and seamlessly (no privileged container, no changes in the software inside the container, and no tricky container entrypoints). To the best of our knowledge, it's not possible to run say Kubernetes inside a regular container unless it's a privileged container with a custom entrypoint. Or inside a Firecracker VM. If you know otherwise, please let us know.
Firecracker runs a full Linux kernel inside the VM, so it could always run regular Docker, Kubernetes or anything else. See [6] for a practical example.
For containers, this used to be the case, but the situation improved in recent kernel releases.
For podman, almost every combination works - running systemd unprivileged, running podman inside podman, or even running rootless-podman-in-rootless-podman[7] and so does Kubernetes-in-rootless-{podman,docker}[8] (requiring very recent kernel features, though - notably cgroupsv2 and unprivileged overlayfs).
Running docker:dind-rootless inside unprivileged Docker containers also works, however, it requires "--security-opt seccomp=unconfined".
Sysbox definitely got to that point earlier and has better usability.
> - Regarding "The claim that their solution offers large security improvements over any other solution with user namespaces isn't true". Where do you see that claim? The table explicitly states that there are solutions that provide stronger isolation.
Apologies, then, for misinterpreting that.
[1]: https://blog.nestybox.com/2020/10/06/related-tech-comparison...
[2]: https://github.com/nestybox/sysbox/issues/120#issuecomment-9...
[3]: https://object-storage-ca-ymq-1.vexxhost.net/swift/v1/6e4619...
[4]: https://www.scitepress.org/Papers/2021/104405/104405.pdf
[5]: https://gvisor.dev/blog/2021/12/02/running-gvisor-in-product...
[6]: https://github.com/innobead/kubefire
[7]: https://www.redhat.com/sysadmin/podman-inside-container
[8]: https://kind.sigs.k8s.io/docs/user/rootless