Hacker News new | ask | show | jobs
by jsiepkes 1868 days ago
In my opinion Linux hasn't caught up.

* Namespaces don't come close to FreeBSD jails or Solaris / Illumos Zones. There is a reason Docker hosters put their Docker tenants in different hardware VM's. Because the isolation is too weak.

* Due to CDDL and GPL problems ZFS on Linux will always be hard to use making every update cycle like playing Russian roulette.

And there are other benefits. Like SMF offers nice service management while not providing half an operating system like systemd.

3 comments

The problem with this jails/zones stuff is that I don't know anyone who seriously trusts jails and zones for real multitenant workloads anyways. The dealbreaker problem remains a shared kernel attack surface between tenants. It's one thing to propose that Zones are better than namespaces (they probably are), but another thing to cross the threshold where the distinction is meaningful in practice.
At Joyent, we deployed public-facing multitenant workloads based on zones (and before that, jails) for many years. We seriously trusted it -- and had serious customers who seriously depended on it. So, now you know someone!
To be fair, y'all had some serious vulnerabilities, including zone escapes and arbitrary kernel memory reads, discovered by @benmmurphy.
Yes, though I would like to believe that Ben's responsible disclosure coupled with our addressing those vulns (and auditing ourselves for similar) reflect exactly that seriousness around multitenant security. And for whatever it's worth, one of those vulnerabilities -- which was a bug in my code! -- very much informed by own thinking about the inherent unsafety of C, underscoring the appeal of Rust. So I am grateful in several dimensions!
If you have a kernel implemented in Rust, (1) you should shout that from the rooftops and (2) use whatever isolation mechanism you like on it.
They're starting with the bootloader and management engine. That's a tough enough ocean to boil.

Give them some time to get Rust above that.

To this, all I can say is that I spent from 2005-2014, and then from 2016-2020, doing nothing but security evaluations of products, probably about 60% of which were serverside multitenant SAAS systems of one form or another, and I don't remember ever evaluating (or overseeing the evaluation of) a system that relied on Jails or Zones. Lots of Docker! And, until a few years ago, multitenant Docker isolation was an infamous joke! I'm not sticking up for it!

You can look at the recent history of Linux kernel LPEs --- there has been sort of a renaissance because of mobile devices --- and count all the ways any shared-kernel multitenant system would have broken down. At the end of the day, it's not so much about predicting whether your system can get owned up (it can), so much as: "what do I need to do when there is a kernel LPE announced on my platform". If you're doing shared-kernel isolation, the right answer to that question is usually "fire drill". It's not a noodley thought-leadership kind of question; it's a simple, practical concern.

There were also tons of providers who trusted Linux containers for VPS hosting.
How'd that turn out?
I haven't heard any stories of people being hacked via container escape, but the whole VPS industry was so low-stakes that maybe customers didn't expect good isolation anyway.
And needless to say it became a billion dollar business, with a great product.
They were acquired for $170m.
I stand corrected. Still great product, business and team.
I'm sure they're great. No part of what I have to say about this has anything to do with how competent they are.
Security requirements (and awareness) have increased over the years, have they not?
They definitely have! And we had a (zones-based) public cloud through it all. On that note, Alex Wilson's description of working with Robert Mustacchi on mitigating Meltdown by adding KPTI to illumos[0] definitely merits a read!

[0] https://blog.cooperi.net/a-long-two-months

Also, tools for improving Docker for multi-tenant workloads exists, like gVisor. I don’t think equivalents exist for jails/zones really.
gVisor isn't a shared-kernel multitenant system; it's essentially kernel emulation. It's a much stronger design.
I mostly mean that it is intended to be a solution to run containers from multiple tenants on the same host. Though I do agree, being essentially a kernel in itself, it is a bit in a different wheelhouse. It still is a huge value add that you can implement something like that on top of Docker, imo.
You can run container workloads in "real" VMs too; for instance, check out Kata Containers. Containers are a way of packaging applications; confusingly, they happen to also have a reference standard runtime associated with them. But you don't have to use it.
Of course, and that's the value of the abstraction to me. Docker itself is obviously nothing to do with the Linux container technologies themselves that make up the equivalent functionality of FreeBSD jails, but I'm not aware of any equivalent abstraction that works around jails or zones even though it might be possible. So the way I see containers on Linux is not literally a composition of kernel features like cgroups or seccomp, but as an abstract thing that can be composed out of various primitives. And in practice, there's a number of different runtimes around it, including Docker clones like Podman, or tools that manage effectively chroots much closer to what you would do with jails.

That said, I could just be completely wrong, and there could be similar things that can be done using jails and zones. But when I looked around for similar art with FreeBSD jails, either with regards to Docker's style of packaging and distribution, or with regards to additional layers like gVisor, it didn't seem like a thing well-suited to that kind of composition. In comparison, jails, at least, seem kind of like more powerful chroots. To me this is a pretty big difference versus Linux "containers".

Similar to Xen?
Much weirder than Xen.
> The dealbreaker problem remains a shared kernel attack surface between tenants.

Also, now, extremely subtle and hard-to-mitigate timing attacks between tenants.

In fairness, that's an attack class that's very difficult to eradicate even with virtualization.
> In my opinion Linux hasn't caught up.

I completely agree. I love Linux and it’s easily my preferred desktop OS but when it comes to stuff like ZFS, containerisation and other enterprise features, FreeBSD and Solaris are just more unified and consistent. A lot of it has to do with Linux being a hodgepodge of different contributors resulting in every feature effectively being a 3rd party feature. Which I think is the problem Pottering was trying to solve. And in many ways that’s quite a strength too. But ultimately it boils down to the old Perl mantra of “There's more than one way to do it” and how it’s fun for hackers but FreeBSD et al add the “but sometimes consistency is not a bad thing either” part of the mantra doesn’t too.

https://en.m.wikipedia.org/wiki/There%27s_more_than_one_way_...

> Namespaces don't come close to FreeBSD jails or Solaris / Illumos Zones. There is a reason Docker hosters put their Docker tenants in different hardware VM's. Because the isolation is too weak.

This is largely a myth, please provide an namespace-related CVE that has gone unpatched to support your argument. The reason they run as VMs is that hypervisors run on ring 0 and require higher privileges than the kernel, therefore they are naturally more secure. Like Namespaces, Zones and Jails are also managed by their respective kernels. If there were any major hosters running managed services for Zones and Jails, you can bet they would implement them in a similar way.

> Due to CDDL and GPL problems ZFS on Linux will always be hard to use making every update cycle like playing Russian roulette.

You're right in that the CDDL causes complication but I don't consider this to be a compelling reason to use Illumos. Many who want to use ZFS on Linux will use it and get it to work despite the licensing issues and complications.

> Like SMF offers nice service management while not providing half an operating system like systemd.

SMF is relatively nice (apart from the use of XML) and like you, I would not touch systemd barge pole. Despite systemd making a lot noise in major distros, there are plenty alternative distros for those of us who don't want to use it.

Don't get me wrong, I'm a Solaris guy, it made my career. I just fear that by dropping SPARC, Illumos have put the final nail in their own coffin.

> I just fear that by dropping SPARC, Illumos have put the final nail in their own coffin.

If that were true, most illumos users today would be SPARC users. There would be more than a couple of people working on SPARC support, and not merely as a part-time hobby. There would be software support for a SPARC machine that was sold some time after 2011.

Instead, something like 99% of the people running illumos are doing so on 64-bit x86 machines. Dropping SPARC support will allow us to move forward much more easily with enhancements to the dramatically more relevant x86 bits. If anything I expect it will allow us to do interesting things that would garner new interest, like using Rust to implement bits of the operating system.

Thanks for the reply. That surprises me but I'm glad to hear that x86 support is strong. It does lead me to wonder, what sort of things are people using Illumos for? Maybe it's time I checked it out. :)
> This is largely a myth, please provide an namespace-related CVE that has gone unpatched to support your argument.

What I mean is that if you use LXC namespace's as a container it is going to be an insecure container. Simply because LXC namespace's are not containers and are not going to provide a fully isolated environment. Namespace's are low-level building blocks which, together with other technologies (for example a virtualized network stack), you can use to make fully isolated containers. And that's why most hosters just took a shortcut and put the whole thing in a hardware VM to ensure tenants are fully isolated. Which I think is a shame since you also get all the overhead of a hardware VM.

So sure, you can _make_ something like jails or zones on Linux if you combine a bunch of things and provide the glue. But there is no concept of a container like jails or zones in Linux. Which leads to other problems such as there not being any tooling to mange the (non-existant) container.

> The reason they run as VMs is that hypervisors run on ring 0 and require higher privileges than the kernel, therefore they are naturally more secure.

I don't know if I fully understand what your saying here but I think you mean that with a type 2 hypervisor the hypervisors kernel runs in a more privileged mode on the CPU then the virtualized kernels it manages and that provides additional security?

I don't really see how a type 2 hypervisor would conceptually give additional security in regards to a type 1 hypervisor (where a single kernel can provide multiple OS instances such a FreeBSD with Jails or Solaris / Illumos with Zones). Everything that is not the "main" kernel always executes in a less privileged mode then the kernel executing them on the CPU. For example no user process executes on Ring 0 on a "normal" (ie. non-hypervisor) OS. With containers this is no different. Hardware virtualization doesn't give a big conceptual advantage in that regard as far as I know.