Hacker News new | ask | show | jobs
by swap32 3500 days ago
It is fine for dev, but not for production. To quote from the link at the end of the article- "Docker is meant to be stateless. Containers have no permanent disk storage, whatever happens is ephemeral and is gone when the container stops. Containers are not meant to store data. Actually, they are meant by design to NOT store data. Any attempt to go against this philosophy is bound to disaster.

Moreover. Docker is locking away processes and files through its abstraction, they are unreachable as if they didn’t exist. It prevents from doing any sort of recovery if something goes wrong"

"A crash would destroy the database and affect all systems connecting to it. It is an erratic bug, triggered more frequently under intensive usage. A database is the ultimate IO intensive load, that’s a guaranteed kernel panic. Plus, there is another bug that can corrupt the docker mount (destroying all data) and possibly the system filesystem as well (if they’re on the same disk)."

1 comments

All these statements are patently false.

1. Containers are not ephemeral. They have a lifecycle. Data written in the container is persisted to disk and available after the container is stopped and then started again.

2. Processes/files/etc are not locked away as if they don't exist. See `ps aux` on the host. You will see all the processes running. You can inspect the filesystems for each container, etc. There is no magic here.

3. A database crash could cause data corruption inside a container or not. This has nothing to do with the container, and chances of a database crash are not made worse by being in a container.

That said, I would let a volume driver manage persistent storage rather than manually managing this through the host fs... but that's my preference.

--- EDIT --- Disclaimer: I work at Docker Inc, and am a maintainer on the Docker project.

Author from the quoted paragraph here.

0. The lifecycle of docker containers is an extremely complex topic with limited documentation. It's safe to assume that it's out of reach for 9X% of readers here. One needs to fully understand the lifecycle of their containers to attempt to run databases in Docker, that's a huge barrier to entry. Advising 100% of people to run production (i.e. permanent, long lived) databases in Docker is terrible advise.

1. The entire concept of containers is based on being ephemeral. They do have a storage (in /var/lib/docker/<cryptic-structure>) and they should be started with -rm to make sure that everything they did is cleaned up automatically after they exit. If you want to keep the data and make something around that, good look with that!

2. Wrong. There is a truckload of magic going on here from filesystems to networking. Docker is hell to debug. A fucked database hidden away in Docker will be close to impossible to debug. If you're a sysadmin, you do not want to be in that position, trust me.

3. The odds of a database issues are at lest 3 orders of magnitudes higher if running within Docker. The docker ecosystem is notoriously unstable and the filesystems are unreliable. (Plus Databases are IO intensive which is gonna trigger all the rare bugs and race conditions).

Seriously. If you got a brain cell at Docker Corp. PLEASE STOP overselling your product and advising it for absolutely everything without considerations for what people are doing.

Every time one of you guys advise to run databases in Docker, you're objecting to everything that docker stands for (i.e. statelessness). Not only it is confusing the hell out of people but it's putting them on a guaranteed path for future catastrophic failures.

Running production databases inside docker. Just because it's not strictly impossible, doesn't mean it's possible.

    [See RFC1925 https://tools.ietf.org/html/rfc1925 ]
   (3)  With sufficient thrust, pigs fly just fine. However, this is
        not necessarily a good idea. It is hard to be sure where they
        are going to land, and it could be dangerous sitting under them
        as they fly overhead.
0. There is a plethora of documentation. Even the CLI suggests the lifecycle (start, stop, restart, pause, unpause).

1. This is simply not true. Your understanding is that they are based on being ephemeral, but this is not inherent in any sort of design of containers.

2. Magic is not really magic when you understand what's happening. Cgroups apply resource limits on a process, namespaces limit what a process can see. These come together to make containers. The host still has full visibility on these processes just like any other process on the system.

3. Do you have data to back this up? A container is just a process that is namespaced and resource limited. If you are writing to the copy-on-write filesystem provided for the container with a database, then you are doing it wrong (in 99% of cases). For that matter, you can even use ZFS for the container FS, which has been in use in production scenarios for quite some time... performance may not be great with ZFS here but integrity will be (not that I'm advocating for writing directly to the container FS... not at all, really).

There is nothing about Docker and statelessness. It can sure make cleaning up after a process a bit simpler but this doesn't mean that docker equates to statelessness.

Storage is hard whether you are in a container or not. Process isolation does not affect this.

0. That doesn't explain anything about what's happening underneath. It's far from enough to even form a mental model about Docker operations.

1. The stateless & The ephemeralness & The tooling. It all goes together. Just because its not enforced all the time at every level doesn't mean that it's a good idea to diverge from it.

2. What about the networking? the DNS magic? the storage? the filesystems? the lifecycle of data across containers & images and containers & further containers? the log management? the logging drivers? It would take multiple books to cover these topics.

3. Again the filesystem and storage issue should cover an entire book. There are many blog posts and issues talking about that. ZFS only became available very recently and exclusively to Ubuntu, it's ridiculous to consider that as a real world scenario.

Docker equals stateleness. That's the only thing it's supposed to do and could do well. Maybe you should consider focusing on one use case that Docker does well (i.e. packaging & deploying stateless applications). That would make up for better documentations and explanations and goals ;)

(IMO. After reading your comments, it seems that you have no clue whatsoever about systems internals [or maybe we just don't communicate well on that]. That's scary if Docker itself doesn't have a clue about what it is nor what it should be.)

> For that matter, you can even use ZFS for the container FS, which has been in use in production scenarios for quite some time... performance may not be great with ZFS here but integrity will be

It's not a very good fit for a production database if "performance may not be great"?

> (not that I'm advocating for writing directly to the container FS... not at all, really).

> There is nothing about Docker and statelessness.

You just recommended against storing state in the container FS on the previous line. What kind of state are you advocating a container should keep (that is different from what is captured the docker file and any separate data volumes)?

> Storage is hard whether you are in a container or not. Process isolation does not affect this.

But abstraction does. Normally for a database, you'd have a mirrored set of ssds, lots of ram, spread over a couple of physical nodes. Maybe with a loadbalancer thrown in.

Or maybe you'd run your nodes as a vm, with iscsi or some other nas/das. I can't recall seeing reasonable advice on how to set up such a production system with docker (but I haven't looked all that hard!).

Last time i checked, I couldn't find any suggestions for high-performance, well-tested container storage?

Depends on in high-performance is what you need, but this was just an example of even the container FS can have incredible integrity.

Why would a container keep from using mirrored sets of SSDS, RAM, or an LB?

The absolute worst case you can set these up manually on your host and map the directories into the container.

A better scenario, the various storage systems (EMC, NetApp, Ceph, name it) out there have volume plugins integrating with Docker, Kub, etc.

How to handle storage in the container depends on your needs, just like as if it was VM or a physical machine... and ultimately the setup is in the worst of cases no different.

>> Data written in the container is persisted to disk

Even if you don't mount any folder from the disk onto the container? Are you sure? Then everything I know about containers is just wrong.

I think there's a bit of mix up between what an image is and what a container is. You normally don't write data to an image, but you can write data to the container. You can keep this data so long as you keep the container, and you can commit that container's data to the image if you wish.

I've found a good place to visualize how the filesystems work was this blog: http://merrigrove.blogspot.com/2015/10/visualizing-docker-co...

Yes. The filesystem in the container is a real filesystem backed by the disk.

How this happens is dependent on the storage driver used. The `aufs` driver (default when available), as well as `overlay(2)` and `vfs` drivers just sit on top of the existing filesystem at `/var/lib/docker` (or the defined docker root). BTRFS, ZFS, and devicemapper must be pre-configured to even use and depends on how you configure these, but still generally would be on an actual disk.

I keep hearing that you shouldn't containerize databases. What is the motivation behind this?
I still don't understand this theory. As stated above, containers have the option to mount volumes on the host file system. Anything written while the container running is immediately persisted, and if the container dies you just re-mount the volume and continue as normal.

To harden this even further, you can run clustered DB nodes in Docker (+<your_preferred_orchestration_tool>) quite easily. So with persisted data, multiple node replication, and server snapshots I'd be interested to know as well.