Hacker News new | ask | show | jobs
by nyrikki 60 days ago
I can't comment directly on LXC but LXC is very different from runc/crun/your-CRI here, not better or worse, just different.

With podman, unfortunately we don't k8s Container Storage Interface (CSI), so you have to work with what you have.

When I said:

> it is often much safer to use mount NFS internally

What is more correct, is having the container runtime or container manager mount them, not the user inside the container.

But as you are trying to run unprivileged or at least with minimal privileges, which is all we can do with namespaces, you are cutting across the grain.

I do use podman pods and containers, mostly for the ease of development, but on more traditional long lived hosts.

I have a very real need to separate UIDs between co-hosted products, but don't need to actually run a VM for these specific use cases.

So I have particular rootful tasks that have to be done as the user root root in ansible:

1) Install OS packages 2) Create service admin and daemon user 3) Assign subuid/subgids ranges to those user security domains as needed 4) For specific services add NFS data directories to /etc/fstab with the 'user' and 'noauto' flags

In Podman I would then create

     podman volume create --driver local --opt type=nfs --opt device=192.168.1.84:/path/to/share --opt o=addr=192.168.1.84,....

     podman run -d  --name nfs_test -v nfs-shared:/opt docker.io/library/debian:latest
Which if you don't have the fstab entry will give you:

     Error: mounting volume nfs-shared for container ...: mount.nfs: Operation not permitted for 192.168.1.84:/path/to/share on /home/user/.local/share/containers/storage/volumes/nfs-shared/_data
That `_data` is one of the hints of the risk of host bind mounts, the risk is either having an inode that the host cares about or issues across containers etc...

While imperfect, this is following the named volume pattern, which really just uses tells about it being in a container and doesn't expose the mount inode to the container.

What does happen inside the container entry point is validating that the expected UID is reachable, adding a user with the right UID offset and switching to that user.

A misconfigured host bind mount or leaking because you can't view who has access are the most common problems, and as containers run with elevated privileges until you drop them they can get around those protections, even if they aren't elevating to root in a rootless situation, they can still access the data of any running container with just a few trivial mistakes or new discovered vulnerabilities.

    $ capsh --decode=00000000800405fb 0x00000000800405fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap

While NFS is absolutely a whole new ball of wax with other issues, one nice thing is that (at least the servers I know of) don't even support the concept of user namespaces and UID mapping, which makes it fragile and dangerous if you start mapping uid/gid's in, but can be an advantage if you can simply isolate uid/gid ranges.

IMHO it will be horses for courses and depend on your risk appetite as all options are least worst and there simply will be no best option, especially with OCI.

1 comments

Wow, I really appreciate you coming back for the follow-up! It's too late for me to read through it in detail at this moment but:

In the end I discovered that I can combine a "mapall/squash" on the NFS server, a regular NFS mount on Proxmox, and then an `lxc.mount.entry` for the LXC config and the combined effect is an unprivileged container with read-write permissions for the UID/GID specified on the NFS server. If I need more UID/GID combinations I can just create bind mounts and then export those with the appropriate mapall/squash settings.

Thanks again :)