Hacker News new | ask | show | jobs
by junon 325 days ago
> iptables was already doing heavy lifting for other subsystems inside our environment, and with each VM adding or removing its own set of rules, things got messy fast, and extremely flakey

We saw the same thing at Vercel. Back when we were still doing docker-as-a-service we used k8s for both internal services as well as user deployments. The latter lead to master deadlocks and all sorts of SRE nightmares (literally).

So I was tasked to write a service scheduler from scratch that replaced k8s. When it got to the manhandling of IP address allocations, deep into the rabbit hole, we had already written our own redis-backed DHCP implementation and needed to insert those IPs into the firewall tables ourselves, since Docker couldn't really do much at all concurrently.

Iptables was VERY fragile. Aside from the fact it didn't even have a stable programmatic interface, it was also a race condition nightmare, rules were strictly ordered, had no composition or destruction-free system (name spacing, layering, etc), and was just all around the worst tool for the job.

Unfortunately not much else existed at the time, and given that we didn't have time to spend on implementing our own kernel modules for this system, and that Docker itself had a slew of ridiculous behavior, we ended up scratching the project.

Learned a lot though! We were almost done, until we weren't :)

3 comments

I think iptables compiles BPF filters; you could write your own thing to compile BPF filters. In general, the whole Linux userspace interface (with few exceptions) is considered stable; if you go below any given userspace tool, you're likely to find a more stable, but less well documented, kernel interface. Since it's all OSS, you can even use iptables itself as a starting point to build your own thing.
Nowadays you would use nftables, which like most new-ish kernel infra uses netlink as an API, and supports at least atomic updates of multiple rules. That's not to say there's documentation for that; there isn't.
I spent a decade and a bit away from Linux programming and have recently come back to it, and I'm absolutely blown away at how poor the documentation has become.

Back in the day, one of the best things about Linux was actually how good the docs were. Comprehensive man pages, stable POSIX standards, projects and APIs that have been used since 1970 so every little quirk has been documented inside out.

Now it seems like the entire OS has been rewritten by freedesktop and if I'm lucky I might find some two year out of date information on the ArchLinux wiki. If I'm even luckier, that behaviour won't have been completely broken by a commit from @poettering in a minor point release.

I actually think a lot of the new stuff is really fantastic once I reverse engineer it enough to understand what it's doing. I will defend to the death that systemd is, in principle, a lot better than the adhoc mountain of distro-specific shell scripts it replaces. Pulseaudio does a lot of important things that weren't possible before, etc. But honestly it feels like nobody wants to write any docs because it's changing too frequently, but then everything just constantly breaks because it turns out changing complex systems rapidly without any documentation leads to weird bugs that nobody understands.

yeah our findings were similar. the issues we saw with iptables rules, especially at scale with ephemeral workloads, was starting to cause us a lot of operational toil. nftables ftw
I've had this problem.

We ended up using Docker Swarm. Painless afterward