Hacker News new | ask | show | jobs
by nyrikki 31 days ago
Namespaces and cgroups allow for resource accounting and limited isolation between trusted processes. It is only through hard work and luck that they have been usable in the k8s/docker world.

To be 100% clear, namespaces are not a security feature in themselves, but can be used to run processes with reduced privileges and improved isolation, but not for untrusted code.

A few reasons.

1) Kernel features explicitly need to support namespaces, and only the portions that support namespaces have increased isolation, any syscall, socket family, etc… can provide an attack vector for the global kernel.

2) The methods to further constrain processes like LSMs, SecComp, eBPF system calling typically are not implemented by common container images and are difficult for users to develop and deploy.

3) User namespaces have actually increased exposure to user data, if protecting the system itself because of the proliferation of capabilities(7)[0]. Capabilities were designed as a vertical slice of superior(root) user functionality, and the contract is much different than people expect[1][2] We will have to see where things go, but as far as untrusted code, no containers/namespaces/etc… are not sufficient at all. There are just too many holes in the shared kernel and several socket() based backends that are used through netlink etc… Here you can see just how insane the number of default capabilities are granted to every user right now.

     $ grep ^CapBnd /proc/$$/status
     CapBnd: 000001ffffffffff
     $ capsh --decode=000001ffffffffff
     0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore

[0] https://man7.org/linux/man-pages/man7/capabilities.7.html [1] https://elixir.bootlin.com/linux/v7.0.1/source/kernel/capabi... [2] https://www.kernel.org/doc/html/latest/admin-guide/namespace...
1 comments

So then how does bubblewrap/firejail do it through namespaces?
Bubblewrap tends to be better at defaults than docker/rancher/podman, where the users rarely use `USER` and/or drop elevated privileges, it still has the same limitations.

It is just the reality that namespace/seccomp/ebpf/cgroups are privilege dropping and are not jails.

But it is better with common command line options like:

     $ bwrap --ro-bind /usr /usr --ro-bind /bin /bin --ro-bind /lib /lib --ro-bind /lib64 /lib64 --ro-bind /sbin /sbin --ro-bind /etc /etc --proc /proc --dev /dev --tmpfs /tmp /usr/bin/bash
     $ grep ^Cap /proc/$$/status
     CapInh: 0000000000000000
     CapPrm: 0000000000000000
     CapEff: 0000000000000000
     CapBnd: 0000000000000000
     CapAmb: 0000000000000000
     $ grep ^NoNew /proc/$$/status
     NoNewPrivs: 1
But yes it is using the same clone/unshare/capabilities that containers use.

But at least they tend to default to running without elevated privileges.

Note from the bwrap repo[0]

    Whatever program constructs the command-line arguments for bubblewrap (often a larger framework like Flatpak, libgnome-desktop, sandwine or an ad-hoc script) is responsible for defining its own security model, and choosing appropriate bubblewrap command-line arguments to implement that security model.
Or warnings from distros like arch[1]

     Warning
     Bubblewrap is a tool which provides sandboxing technologies like namespaces and seccomp filter. It does not by default provide a full sandbox that isolates weakpoints of a used technology. Running untrusted code is never safe, sandboxing cannot change this.
But spin up a bwrap instance like the above and note how just using pythons socket.socket() you can pretty much get every single kernel module in:

     grep net-pf /lib/modules/`uname -r`/modules.alias
That is not in:

     /etc/modprobe.d/blacklist-rare-network.conf
To autoload the kernel modules. That is probably the easiest way to see that you still have the issues with the shared kernel.

Note that the LSM like apparmor may add constraints on aa systems look at[2]

The bwrap team is better at working the with LSM teams, while OCI actively refuses to give guidance and has a dangerous profile [3]

Once again namespaces/cgroups/seccomp/avoiding elevated privlages/etc... are all important for running with minimal privileges, but yes 'sandboxes' and 'containers' provide much less isolation than most people realize.

[0] https://github.com/containers/bubblewrap#limitations [1] https://wiki.archlinux.org/title/Bubblewrap [2] https://gitlab.com/apparmor/apparmor/-/blob/master/profiles/... [3] https://gitlab.com/apparmor/apparmor/-/blob/master/profiles/...