| There's no need to lecture me, I am very familiar with cgroups, having contributed to their implementation and also maintain runc which is a container runtime (that obviously uses cgroups quite heavily). I've also discussed these issues with other container runtime developers such as the LXC folks and kernel developers. So let's talk about the API. First of all, cgroupv2 requires a single hierarchy. This means that if systemd is using cgroups for managing services, you cannot use cgroups for anything else because systemd will get confused if you create any new hierarchies. You may argue this is a bug in systemd, but I would argue it's because you can't have named cgroup hierarchies in v2 (like you could in v1, which is what systemd uses on v1). But ignoring that "slight" issue, how about we talk about the no-internal node constraints and how subtree control works. First of all, in order to use a cgroup controller you must have all of your ancestor cgroups have that controller activated. So if systemd decides to not use a controller, then you can't use it either (without messing with things that systemd thinks it owns). But ignoring that, let's say you want to create a new cgroup under inside your user session (we've already established systemd won't like that, but let's assume that systemd plays along). You can't just create a new subcgroup (you won't be able to use the controllers), you have to create two and then move all of the other processes into one and then the process you wanted to control into the other. While this may sound okay, you have to realise that as a container runtime you now have to mess with processes that you have no control over or idea what they do. Not to mention that there's no way to atomically move all processes into a cgroup (so there'll be race conditions in trying to set this up). The "delegation" model of cgroupv2 is effectively based around the systemd delegation model, where the higher level has to semantically grant you the right to manage your own resources. What kind of resource management system requires you to request the right to manage your own resources? prlimit(2) doesn't do that. cgroupv1 somewhat had this issue as well, but there is another cgroupv2 limitation added that actually means that even if you have write access to a child cgroup you still need to have write access in your current cgroup in order to move it into the child. Write access to cgroups.proc is actually a privilege in cgroups, so giving users access to this won't always be desirable, but it also further bakes in the management process design. I've talked to Tejun on the mailing lists, and it's very clear that he prioritises the model of having a higher level process managing cgroups. In discussions about making unprivileged subtree delegation (something that is necessary for rootless containers to use cgroups) he made it clear that he isn't interested in the feature because it will cause systemd issues because it manages all cgroups on a system. There's actually even more stuff you have to do to manage cgroups if you're not systemd by the way. I've talked to some LXC folks and we collated a list of 12 of different cases and things you need to deal with in order to use cgroupv2 effectively (and all of them break rootless containers, as well as making container runtimes very "noisy neighbours" as a result). cgroupv1 (despite its downsides) had none of these issues. The only current user of cgroupv2 is systemd, and they've had several instances where they broke every container runtime because they flipped the cgroupv2 switch early. Yes this was a rant, but I'm really tired of people defending this. cgroupv2 did make some good decisions, but then followed up by making some truly awful ones. |
A control group on the machine in front of me tells me that you are wrong about two more things.
Unprivileged subtree delegation exists, that being a control group delegated to my account which has a whole subtree of further control groups in it, managed by multiple unprivileged processes. Your problem with "rootless" containers is not because of the non-existence, because Tejun Heo "isn't interested", of something that visibly exists. That's clearly not a correct description of the situation at all. Furthermore, https://lkml.org/lkml/2017/6/25/4 and https://lkml.org/lkml/2017/6/25/6 tell me that far from "isn't interested", Tejun Heo is interested in subtree delegation to unprivileged users. After all, xe is fidding with it right now.systemd is not the sole user of version 2 control groups.