| It's a nice blog post but it still misses a few important building blocks without which it would be trivial to escape a container running as root. Apart from chroot, cgroups and namespaces, the containers are also build upon: 1) linux capabilities - that split the privileges of a root user into "capabilities" which allows limiting the actions a root user can do (see `man 7 capabilities`, `cat /proc/self/status | grep Cap` or `capsh --decode=a80425fb`) 2) seccomp - which is used to filter syscalls and their arguments that a process can execute. (fwiw Docker renders its seccomp policy based on the capabilities requested by the container) 3) AppArmor (or SELinux, though AppArmor is the default) - a LSM (Linux Security Module) used to limit access to certain paths on the system and syscalls 4) masked paths - container engines bind mounts certain sensitive paths so they can't be read or written to (like /proc/sysrq-trigger, /proc/irq, /proc/kcore etc.) 5) NoNewPrivs flag - while not enabled by default (e.g., in Docker) this prevents the user from gaining more privileges (e.g., suid binaries won't change the uid) If anyone is interested in reading more about those topics and security of containers, you may want to read a blog post [0] where I dissected a privileged docker escape technique (note: with --privileged, you could just mount the disk device and read/write to it) and slides from a talk [1] I have given which details the Docker container building blocks and shows how we can investigate them etc. [0] https://blog.trailofbits.com/2019/07/19/understanding-docker... [1] https://docs.google.com/presentation/d/1tCqmGSOJJzi6ZK7TNhbz... |
But once I went through that mental exercise I started reading code in containerd and cri-o. Wow, these are _not_ simple projects; containerd itself having a full GRPC-based service registry for driving dynamic logic via config.
One thing I was pretty disappointed about is how deeply ingrained OSI images are in the whole ecosystem. While you can replace almost all functional parts of runtime, but not really the concept of images. I think images are a poor solution to the problem they solve, and a big downside of this is a bunch of complexity in the runtimes trying to work around how images work (like remote snapshotters).
[0] https://github.com/pdtpartners/nix-snapshotter