| HN Mirror

Thank you for taking the time to reply - happy to discuss this! :)

> It's fair to say that gVisor performs better isolation on syscalls, but it's also fair to say that by adding Linux user-ns and procfs & sysfs emulation, Sysbox isolates the container in ways that gVisor does not.

Have a look at what gVisor actually does: https://gvisor.dev/docs/architecture_guide/security

It fully implements a subset of the Linux kernel ABI in userspace, including procfs and sysfs and even memory and process management. No untrusted code ever interacts with the host kernel. Filesystem and network access goes through an IPC protocol and is handled by the gVisor processes on the host, which in turns runs inside a user namespace and a seccomp sandbox for defense in depth.

This is a much, much stronger level of isolation than your approach or, arguably, even VMs (the trade-off is performance). "Sysbox isolates the container in ways that gVisor does not" just isn't true.

The sysbox approach is one kernel bug away from host system compromise, same as using regular containers. Emulating procfs and sysfs and using user namespaces takes away some of the attack surface and is great defense in depth, but does not provide isolation from the host kernel.

> Also, note that Sysbox is not meant to isolate workloads in multi-tenant environments (for that we think VM-based approaches are better)

I've read numerous claims that sysbox is suitable for untrusted workloads, for instance in [1] and [2].

It's a nice product and certainly much, much better than running docker-in-docker using privileged containers, but given the significant remaining attack surface, this claim could put your customers at risk and should come with a big disclaimer.

> While QEMU may be faster than Firecracker in some metrics in a one-instance comparison, when you start running dozens of instances per host, the overhead of the full VM (particularly memory overhead) hurts its performance (which is the reason Firecracker was designed)

Firecracker was designed for memory efficiency, faster cold start times and security (by virtue of being written in a memory-safe language). It means you can run more containers per host, but the actual workload performance overhead is identical to "normal" VMs and, in some cases, even slightly higher since Firecracker lacks some of the optimization that has gone into QEMU.

> Regarding gVisor performance, we didn't do a full performance comparison vs. KubeVirt, so we may stand corrected if gVisor is in fact slower than KubeVirt when running multiple instances on the same host (would appreciate any more info you may have on such a comparison, we could not find one).

KubeVirt is just plain QEMU VMs using libvirt, which have been compared to gVisor quite extensively[3][4]. There's almost no overhead for memory/CPU and quite a lot of overhead for syscalls (but with big improvements recently with the introduction of VFS2 and soon LisaFS[5]). It's a classic trade-off - gVisor is more secure and efficient than QEMU, allowing a much larger number of instances to run on a host by virtue of better cooperation with the host kernel scheduler and memory management, but for raw performance, a QEMU VM always wins.

> Regarding the claim that standard containers cannot run a full OS, what the table in the GH repo is indicating is that Sysbox allows you to create unprivileged containers (or pods) that can run system software such as Docker, Kubernetes, k3s, etc. with good isolation and seamlessly (no privileged container, no changes in the software inside the container, and no tricky container entrypoints). To the best of our knowledge, it's not possible to run say Kubernetes inside a regular container unless it's a privileged container with a custom entrypoint. Or inside a Firecracker VM. If you know otherwise, please let us know.

Firecracker runs a full Linux kernel inside the VM, so it could always run regular Docker, Kubernetes or anything else. See [6] for a practical example.

For containers, this used to be the case, but the situation improved in recent kernel releases.

For podman, almost every combination works - running systemd unprivileged, running podman inside podman, or even running rootless-podman-in-rootless-podman[7] and so does Kubernetes-in-rootless-{podman,docker}[8] (requiring very recent kernel features, though - notably cgroupsv2 and unprivileged overlayfs).

Running docker:dind-rootless inside unprivileged Docker containers also works, however, it requires "--security-opt seccomp=unconfined".

Sysbox definitely got to that point earlier and has better usability.

> - Regarding "The claim that their solution offers large security improvements over any other solution with user namespaces isn't true". Where do you see that claim? The table explicitly states that there are solutions that provide stronger isolation.

Apologies, then, for misinterpreting that.

[1]: https://blog.nestybox.com/2020/10/06/related-tech-comparison...

[2]: https://github.com/nestybox/sysbox/issues/120#issuecomment-9...

[3]: https://object-storage-ca-ymq-1.vexxhost.net/swift/v1/6e4619...

[4]: https://www.scitepress.org/Papers/2021/104405/104405.pdf

[5]: https://gvisor.dev/blog/2021/12/02/running-gvisor-in-product...

[6]: https://github.com/innobead/kubefire

[7]: https://www.redhat.com/sysadmin/podman-inside-container

[8]: https://kind.sigs.k8s.io/docs/user/rootless