Hacker News new | ask | show | jobs
by topspin 1054 days ago
Accessing local file systems from a container? What heresy is this? Containers must all be stateless webscale single-"process" microservices with no need of local file systems and other obsolescent concepts.

Next thing you know someone will run as many as two whole "processes" in a container!

Having dispensed with that bit of bitter sarcasm; solving their local filesystem performance/security problems is great and all, but what I'd like to see for containers is to utilize an already invented wheel of remote block devices; ah la iSCSI and friends. I dream of getting there with Cloud Hypervisor or some such where every container has a kernel that can network transparently mount whatever it has the credentials to mount from whatever 'worker' node it happens to be running on.

3 comments

In k8s that already exists via CSI[0] but kubelet is handling the setup/teardown signaling and it requires 3rd party provisioner daemon so higher level than container runtime (runsc in this case).

[0] - https://kubernetes-csi.github.io/docs/

Yes. I know. K8s has delivered the moral equivalent of what we've had built-in to our OS kernels[1] since before some of the people reading this were born, and they've only had to add two layers of complexity, fragility and inscrutability on top of k8s itself, one of which is a third party dependency.

This is my excited face. :|

[1] 2005: https://lwn.net/Articles/131747/

No k8s has not delivered that. It's built an orchestration layer on top of iSCSI, NVMeOF or whatever "remote disk" tech the kernel has implemented and abstracted that from devs which was the whole point of k8s.
> abstracted that from devs which was the whole point of k8s

That may be the point, but the actual impact is "devs" became "devops" and now spend some multiple of their time actually developing software puzzling over operations abstractions.

This would mean that every container has its own buffer cache, you can no longer have intentional shared state (K8s secrets, shared volumes, etc.), and must construct block overlays instead of cheap file overlays. You’re definitely losing some of the advantages a container brings.

There are other advantages — low fixed resource costs, global memory management and scheduling, no resource stranding, etc. — but the core intent of gVisor is to capture as many valuable semantics as possible (including the file system semantics) while adding a sufficiently hard security boundary.

I’m not saying moving the file system up into the sandbox is bad (which is basically what a block device gives you), just that there are complex trade-offs. The gVisor root file system overlay is essentially that (the block device is a single sparse memfd, with metadata kept in memory) but applied only to the parts of the file system that are modified.

A container, being basically a chroot, consumes a rather small amount of resources, mostly as space in namespace and ipfilter tables.

If your containers use many of the same base layers (e.g. the same Node or Python image), the code pages will be shared, as they would be shared with plain OS processes.

Running several processes in a container is the norm. First, you run with --init anyway, so there is a `tini` parent process inside. Then, Node workers and Java threads are pretty common.

Running several pieces of unrelated software in a container is less common, that's true.

Containers are a way to isolate processes better, and to package dependencies. You could otherwise be doing that with tools like selinux and dpkg, and by setting LD_nnn env variables. Containers just make it much easier.

> Running several processes in a container is the norm.

I'm highly aware. The reason the word "process" is quoted in my highly down-voteable comment is the misuse of the term "process" by Docker et al. to mean "application." Google the "one process per container" mantra to see what I mean. Somehow the Docker crowd were oblivious to the 60+ year old concept of and terminology related to OS processes when they promulgated their guidance on how containers should be used.

I try not to indulge too many hang-ups in life, but that particular bit of damage is insufferable.

I quite like containers to limit/reserve the ram/cpu use for certain processes. For example imagine a tiny service used by Few concurrent users that needs a SQL db, a app server, and a reverse proxy (For ssl/caching) in front. I'm quite happy to put stuff like this on a tiny VM with 1vCPU and 1GB RAM. Mnthly cost ~$5 for compute. I typically reserve/limit 64MB/128MB for nginx, 384mb/512mb for mariadb, and 256mb/384mb for the app server (PHP etc).Also I have cpu share reservation/limits too. Of course it requires tuning the configs, but it runs great(verified with load testing and actual use). If you put the same software on the same host with no reservations/limits there are situations where latencies grow a lot or the whole thing freezes because one component consumes too much resources. If anyone knows any non-container lie overhead ways to partition a single vcpu and a gig of ram like this I'd be interested to hear about it.
Systemd has a mechanism[0] for configuring those limits.

I believe you can limit a unit to 1 vCPU and 256MB of memory by using something like the following:

[Service]

CPUQuota=100% # 100% of a core

MemoryLimit=256MB

Red Hat has some documentation[1] as well if the systemd stuff is too oblique.

[0]: https://www.freedesktop.org/software/systemd/man/systemd.res...

[1]: https://access.redhat.com/documentation/en-us/red_hat_enterp...

> If anyone knows any non-container lie overhead ways to partition a single vcpu and a gig of ram like this I'd be interested to hear about it.

You can use cgroups[1] to do this, because that's what your container runtime is doing. People don't know this because they think these features have something to do with their container runtime and that's what they use, so no one discovers it.

Plus, the user facing tools for cgroups are slightly hideous. And that won't ever get fixed for the reasons previously stated. Sigh.

Also, I'm sure a lot of people would appreciate learning about your tuning techniques, containers or otherwise. Consider writing it up.

[1] circa 2007...

You can set up same limits via systemd units as they are using same interfaces as containers. They can be set up via systemd overrides to existing services so they are pretty straightforward

Just be beware of same problems like swap trap (limiting only memory and not memory and swap will just make apps hitting the memory limit start to swap like hell)