Hacker News new | ask | show | jobs
by compsciphd 699 days ago
As the person who created docker (well, before docker - see https://www.usenix.org/legacy/events/atc10/tech/full_papers/... and compare to docker), I argued that it wasn't just good for containers, but could be used to improve VM management as well (i.e. a single VM per running image - seehttps://www.usenix.org/legacy/events/lisa11/tech/full_papers...)

I then went onto built a system with kubernetes that enabled one to run "kubernetes pods" in independent VMs - https://github.com/apporbit/infranetes (as well as create hybrid "legacy" VM / "modern" container deployments all managed via kubernetes.)

- as a total aside (while I toot my own hort on the topic of papers I wrote or contributed to), note the reviewer of this paper that originally used the term Pod for a running container - https://www.usenix.org/legacy/events/osdi02/tech/full_papers... - explains where Kubernetes got the term from.

I'd argue that FreeBSD Jails / Solaris Zones (Solaris Zone/ZFS inspired my original work) really aren't any more secure than containers on linux, as they all suffer from the same fundamental problem of the entire kernel being part of one's "tcb", so any security advantage they have is simply due lack of bugs, not simply a better design.

6 comments

> As the person who created docker (well, before docker - see https://www.usenix.org/legacy/events/atc10/tech/full_papers/... and compare to docker)

I picked the name and wrote the first prototype (python2) of Docker in 2012. I had not read your document (dated 2010). I didn't really read English that well at the time, I probably wouldn't have been able to understand it anyways.

https://en.wikipedia.org/wiki/Multiple_discovery

More details for the curious: I wrote the design doc and implemented the prototype. But not in a vacuum. It was a lot work with Andrea, Jérôme and Gabriel. Ultimately, we all liked the name Docker. The prototype already had the notion of layers, lifetime management of containers and other fundamentals. It exposed an API (over TCP with zerorpc). We were working on container orchestration, and we needed a daemon to manage the life cycle of containers on every machine.

I'd note I didn't say you copied it, just that I created it first (i.e. "compare paper to docker". also, as you note, its possible someone else did it too, but at least my conception got through academic peer-review / patent office, yeah, there's a patent, never been attempted to be enforced though to my knowledge).

when I describe my work (I actually should have used quotes here), I generally give air quotes when saying it, or say "proto docker", as it provides context for what I did (there's also a lot of people who view docker as synonymous with containerization as a whole, and I say that containers existed way before me). I generally try to approach it humbly, but I am proud that I predicted and built what the industry seemingly needed (or at least is heavily using).

people have asked me why I didn't pursue it as a company, and my answer is a) I'm not much of an entrepreneur (main answer), and b) I felt it was a feature, not a "product", and would therefore only really profitable for those that had a product that could use it as a feature (which one could argue that product turned out to be clouds, i.e. they are the ones really making money off this feature). or as someone once said a feature isn't necessarily a product and a product isn't necessarily a company.

I understood your point. I wanted to clarify, and in some ways connect with you.

At the time, I didn't know what I was doing. Maybe my colleagues did some more, but I doubt that. I just wanted to stop waking up at night because our crappy container management code was broken again. The most brittle part was the lifecycle of containers (and their filesystem). I recall being very adamant about the layered filesystem, because it allowed to share storage and RAM across running (containerized) processes. This saves in pure storage and RAM usage, but also in CPU time, because the same code (like the libc for example) is cached across all processes. Of course this only works if you have a lot of common layers. But I remember at the time, it made for very noticeable savings. Anyways, fun tidbits.

I wonder how much faster/better it would have been if inspired by your academic research. Or maybe not knowing anything made it so we solved the problems at hand in order. I don't know. I left the company shortly after. They renamed to Docker, and made it what it is today.

I like to say that Docker wouldn’t exist if the Python packaging and dependency management system weren’t complete garbage. You can draw a straight line from “run Python” to dotCloud to Docker.

Does that jive with your experience/memory at all? How much of your motivation for writing Docker could have been avoided if there were a sane way to compile a Python application into a single binary?

It’s funny, this era of dotCloud type IaaS providers kind of disappeared for a while, only to be semi-revived by the likes of Vercel (who, incidentally, moved away from a generic platform for running containers, in favor of focusing on one specific language runtime). But its legacy is containerization. And it’s kind of hard to imagine the world without containers now (for better or worse).

I do not think the mess of dependency management in Python got us to Docker/containers. Rather Docker/containers standardized deploying applications to production. Which brings reproducibility without having to solve dependency management.

Long answer with context follows.

I was focused on container allocation and lifecycle. So my experience, recollection, and understanding of what we were doing is biased with this in mind.

dotCloud was offering a cheaper alternative to virtual machines. We started with pretty much full Linux distributions in the containers. I think some still had a /boot with the unused Linux kernel in there.

I came to the job with some experience testing deploying Linux at scale quickly by preparing images with chroot before making a tarball to then distribute over the network (via multicast from a seed machine) with a quick grub update. This was for quickly installing Counter-Strike servers for tournament in Europe. In those days it was one machine per game server. I was also used to run those tarball as virtual machines for throughout testing. To save storage space on my laptop at the time, I would hard-link together all the common files across my various chroot directories. I would only tarball to ship it out.

It turned out my counter-strike tarballs from 2008 would run fine as containers in 2011.

The main competition was Heroku. They did not use containers at the beginning. And they focused on running one language stack very well. It was Ruby and a database I forget.

At dotCloud we could run anything. And we wanted to be known serving everything. All your languages, not just one. So early on we started offering base images ready made for specific languages and database. It was so much work to support. We had a few base images per team member to maintain, while still trying to develop the platform.

The layered filesystem was to pack ressources more efficiently on our servers. We definitely liked that it saved build time on our laptop when testing (we still had small and slow spinning disks in 2011).

So I wouldn't say that Docker wouldn't exist without the mess of dependency management in software. It just happened to offer a standardized interface between application developers, and the person running it in production (devops/sre).

The fact you could run the container on your local (Linux) machine was great for testing. Then people realized they could work around dependency hell and non reproducible development environment by using containers.

they did it "simpler", i.e. academic work has to be "perfect" in a way a product does not. so (from my perspective), they punted the entire concept of making what I would refer to as a "layer aware linux distribution" and just created layers "on demand" (via RUN syntax of dockerfiles).

From an academic perspective, its "terrible", so much duplicate layers out in the world, from a practical perspective of delivering a product, it makes a lot of sense.

It's also simpler from the fact that I was trying to make it work for both what I call "persistent" containers (ala pets in the terminology) that could be upgraded in place and "ephemeral" containers (ala cattle) when in practice the work to enable upgrading in place (replacing layers on demand) to upgrade "persistent" containers I'm not sure is that useful (its technologically interesting, but that's different than useful).

My argument for this was that this actually improves runtime upgrading of systems. With dpkg/rpm, if you upgrade libc, your systems is actually temporarily in a state where it can't run any applications (in the delta of time when the old libc .so is deleted and the new one is created in its place, or completely overwrites it), any program that attempts to run in that (very) short period time, will fail (due to libc not really existing). By having a mechanism where layers could be swapped in essentially an atomic manner, no delete / overwrite of files occurs and therefore there is zero time when programs won't run.

In practice, the fact that a real world product came out with a very similar design/implementation makes me feel validated (i.e. a lot of phd work is one offs, never to see the light of day after the papers for it are published).

> so (from my perspective), they punted the entire concept of making what I would refer to as a "layer aware linux distribution"

Would you consider there to be any 'layer-aware Linux distributions' today, e.g., NixOS, GuixSD, rpm-ostree-based distros like Fedora CoreOS, or distri?

> so much duplicate layers out in the world

Have you seen this, which lets existing container systems understand a Linux package manager's packages as individual layers?

https://github.com/pdtpartners/nix-snapshotter

(Not GP.)

NixOS can share its Nix store with child (systemd-nspawn) containers. That is, if you go all in, package everything using Nix, and then carefully ensure you don’t have differing (transitive build- or run-time) dependency versions anywhere, those dependencies will be shared to the maximum extent possible. The amount of sharing you actually get matches the effort you put into making your containers use the same dependency versions. No “layers”, but still close what you’re getting at, I think.

On the other hand, Nixpkgs (which NixOS is built on top of) doesn’t really follow a discipline of minimizing package sizes to the extent that, say, Alpine does. You fairly often find documentation and development components living together with the runtime ones, especially for less popular software. (The watchword here is “closure size”, as in the size of a package and all of its transitive runtime dependencies.)

I'll try to get back to this to give a proper response, but can't promise.
I’m really confused. Solomon Hykes is typically credited as the creator of Docker. Who are you? Why is he credited if someone else created it?
> Why is he credited if someone else created it?

This is the internet and just about everyone could be diagnosed with Not Invented Here syndrome. First one to get recognition for creating something that's already been created is just a popular meme.

Solomon was the CEO of dotCloud. Sébastien the CTO. I (François-Xavier "bombela") was one of the first software engineer employee. Along with Andrea, Jérôme and Louis with Sam as our manager.

When it became clear that we had reached the limits of our existing buggy code, I pushed hard to get to work on the replacement. After what seemed an eternity pitching Solomon, I was finally allowed to work on it.

I wrote the design doc of Docker with the help of Andrea, Jérôme, Louis and Gabriel. I implemented the prototype in Python. To this day, three of us will still argue who really choose the name Docker. We are very good friends.

Not long after, I left the company. Because I was under paid, I could barely make ends meet at the time. I had to borrow money to see a doctor. I did not mind, it's the start-up life, am I right? I worked 80h/week happy. But then I realized not everybody was under paid. And other companies would pay me more for less work. When I asked, Solomon refused to pay me more, and after being denied three times, I quit. I never got any shares. I couldn't afford to buy the options anyways, and they had delayed the paperwork multiple times, such that I technically quit before the vesting started. I went to Google, were they showered me with cash in comparison. The next morning after my departure from dotCloud, Solomon raised everybody's salary. My friends took me to dinner to celebrate my sacrifice.

I am not privy to all the details after I left. But here is what I know. Andrea rewrote Docker in Go. It was made open source. Solomon even asked me to participate as an external contributor. For free of course. As a gesture to let me add my name to the commit history. Probably the biggest insult I ever received in life.

dotCloud was renamed Docker. The original dotCloud business was sold to a German company for the customers.

I believe Solomon saw the potential of Docker for all, and not merely an internal detail within a distributed system designed to orchestrate containers. My vision was extremely limited, and focused on reducing the suffering of my oncall duties.

A side story: the German company transfered to me the trademark of zerorpc, the open source network library powering dotCloud. I had done a lot of work on it. Solomon refused to hand me off the empty github/zerorpc group he was squatting. He offered to grant me access but retain control. I went for github/0rpc instead. I did not have time nor money to spend on a lawyer.

By this point you might think I have a vendetta against a specific individual. I can assure you that I tried hard to paint things fairly with a flattering light.

He means he came up with the same concept, not literally created Docker.
> any security advantage they have is simply due lack of bugs, not simply a better design.

feels like maybe there is some corelation

Would you say approaches like gvisor or nabla containers provide more/enough evolution on the security front? Or is there something new on the horizon that excites you more as a prospect?
GVisor basically works by intercepting all Linux syscalls, and emulating a good chunk of the Linux kernel in userspace code. In theory this allows lowering the overhead per VM, and more fine-grained introspection and rate limiting / balancing across VMs, because not every VM needs to run it's own kernel that only interacts with the environment through hardware interfaces. Interaction happens through the Linux syscall ABI instead.

From an isolation perspective it's not more secure than a VM, but less, because GVisor needs to implement it's own security sandbox to isolate memory, networking, syscalls, etc, and still has to rely on the kernel for various things.

It's probably more secure than containers though, because the kernel abstraction layer is separate from the actual host kernel and runs in userspace - if you trust the implementation... using a memory-safe language helps there. (Go)

The increased introspectioncapabiltiy would make it easier to detect abuse and to limit available resources on a more fine-grained level though.

Note also that GVisor has quite a lot of overhead for syscalls, because they need to be piped through various abstraction layers.

I actually wonder how much "overhead" a VM actually has. i.e. a linux kernel that doesn't do anything (say perhaps just boots to an init that mounts proc and every n seconds read in/prints out /proc/meminfo) how much memory would the kernel actually be using?

So if processes in gvisor map to processes on the underlying kernel, I'd agree it gives one a better ability to introspect (at least in an easy manner).

It gives me an idea that I'd think would be interesting (I think this has been done, but it escapes me where), to have a tool that is external to the VM (runs on the hypervisor host) that essentially has "read only" access to the kernel running in the VM to provide visibility into what's running on the machine without an agent running within the VM itself. i.e. something that knows where the processes list is, and can walk it to enumerate what's running on the system.

I can imagine the difficulties in implementing such a thing (especially on a multi cpu VM), where even if you could snapshot the kernel memory state efficiently, it be difficult to do it in a manner that provided a "safe/consistent" view. It might be interesting if the kernel itself could make a hypercall into the hypervisor at points of consistency (say when finished making an update and about to unlock the resource) to tell the tool when the data can be collected.

https://github.com/Wenzel/pyvmidbg

  LibVMI-based debug server, implemented in Python. Building a guest aware, stealth and agentless full-system debugger.. GDB stub allows you to debug a remote process running in a VM with your favorite GDB frontend. By leveraging virtual machine introspection, the stub remains stealth and requires no modification of the guest.
more: https://github.com/topics/virtual-machine-introspection
thanks, the kvm-vmi is basically an expansive version of what I was imagining (maybe read about it before, as noted, I thought it existed).
Would you recommend ZFS as a building block for modern "VLFS"?

https://blog.chlc.cc/p/docker-and-zfs-a-tough-pair/

> I actually wonder how much "overhead" a VM actually has. i.e. a linux kernel that doesn't do anything (say perhaps just boots to an init that mounts proc and every n seconds read in/prints out /proc/meminfo) how much memory would the kernel actually be using?

You don't necessarily need to run a full operating system in your VM. See eg https://mirage.io/

> I actually wonder how much "overhead" a VM actually has. i.e. a linux kernel that doesn't do anything (say perhaps just boots to an init that mounts proc and every n seconds read in/prints out /proc/meminfo) how much memory would the kernel actually be using?

There's already some memory sharing available using DAX in Kata Containers at least: https://github.com/kata-containers/kata-containers/blob/main...

> to have a tool that is external to the VM (runs on the hypervisor host) that essentially has "read only" access to the kernel running on the VM to provide visibility into what's running on the machine without an agent running within the VM itself

Not quite what you are after, but comes close ... you could run gdb on the kernel in this fashion and inspect, pause, step through kernel code: https://stackoverflow.com/questions/11408041/how-to-debug-th....

What I really want is a "magic" shell on a VM - i.e. the ability using introspection calls to launch a process on the VM which gives me stdin/stdout, and is running bash or something - but is just magically there via an out-of-band mechanism.
Not really "out of band", but many VMs allow you to setup a serial console, which is sort of that, albeit with a login, but in reality, could create one without one, still have to go through hypervisor auth to access it in all cases, so perhaps good enough for your case?
Indeed, easy enough to get a serial device on Xen.

Another possibility could be to implement a simple protocol which uses the xenstore key/value interface to pass messages between host and guest?

You can launch KVM/qemu with screen + text console, and just log in there. You can also configure KVM to have a VNC session on launch, and that ... while graphical, is another eye into the console + login.

(Just mentioning two ways without serial console to handle this, although serial console would be fine.)

Go is not memory safe even when the code has none of unsafe blocks, although through a typical usage and sufficient testing memory safety bugs are avoided. If one needs truly memory-safe language, then use Rust, Java, C# etc.
This sounds vaguely like the forgotten Linux a386 (not i386!).
been out of the space for a bit (though interviewing again, so might get back into it), gvisor at least as the "userspace" hypervisor, seemed to provide minimal value vs modern hypervisor systems with low overhead / quick boot VMs (ala firecracker). With that said, I only looked at it years ago, so I could very well be out of date on it.

Wasn't aware of Nabla, but they seem to be going with the unikernel approach (based on a cursory look at them). Unikernels have been "popular" (i.e. multiple attempts) in the space (mostly to basically run a single process app without any context switches), but it creates a process that is fundamentally different than what you develop and is therefore harder to debug.

while the unikernels might be useful in the high frequency trading space (where any time savings are highly valued), I'm personally more skeptical of them in regular world usage (and to an extent, I think history has born this out, as it doesn't feel like any of the attempts at it, has gotten real traction)

I think many people (including unikernel proponents themselves) vastly underestimate the amount of work that goes into writing an operating system that can run lots of existing prod workloads.

There is a reason why Linux is over 30 years old and basically owns the server market.

As you note, since it's not really a large existing market you basically have to bootstrap it which makes it that much harder.

We (nanovms.com) are lucky enough to have enough customers that have helped push things forward.

For the record I don't know of any of our customers or users that are using them for HFT purposes - something like 99% of our crowd is on public cloud with plain old webapp servers.

Modern gVisor uses KVM, not ptrace, for this reason.
so I did a check, it would seem that gvisor with kvm, mostly works for bare metal, not on existing VMs (nested virtualization).

https://gvisor.dev/docs/architecture_guide/platforms/

"Note that while running within a nested VM is feasible with the KVM platform, the systrap platform will often provide better performance in such a setup, due to the overhead of nested virtualization."

I'd argue then for most people (unless have your own baremetal hyperscaler farm), one would end up using gvisor without kvm, but speaking from a place of ignorance here, so feel free to correct me.

My infra is this exactly. K8s managed containers that manage qemu VMs. Every VM has its own management environment, they don’t ever see each other, and they work just the same as using virtual-manager, but I get infinite flexibility in my env provisioning before I start a VM that gets placed in its tenant network that is isolated.
> I argued that it wasn't just good for containers, but could be used to improve VM management as well (i.e. a single VM per running image

Believe Google embarked on this path with Crostini for ChromiumOS [0], but now it seems like they're going to scale down their ambitions in favour of Android [1]. Crostini may not but looks like the underlying VMM (crosvm) might live on [2].

> I'd argue that FreeBSD Jails / Solaris Zones (Solaris Zone/ZFS inspired my original work) really aren't any more secure than containers on linux, as they all suffer from the same fundamental problem of the entire kernel being part of one's "tcb", so any security advantage they have is simply due lack of bugs, not simply a better design.

Jails (or an equivalent concept/implementation) come in handy where the Kernel/OS may want to sandbox higher privilege services (like with minijail in ChromiumOS [3]).

[0] https://www.youtube.com/watch?v=WwrXqDERFm8&t=300 / summary: https://g.co/gemini/share/41a794b8e6ae (mirror: https://archive.is/5njY1)

[1] https://news.ycombinator.com/item?id=40661703

[2] https://source.android.com/docs/core/virtualization/virtuali...

[3] https://www.chromium.org/chromium-os/developer-library/guide...

> I'd argue that FreeBSD Jails / Solaris Zones [...] really aren't any more secure than containers on linux, as they all suffer from the same fundamental problem of the entire kernel being part of one's "tcb", [...]

And also CPU branch prediction state, RAM chips, etc. The side-channels are legion.