Hacker News new | ask | show | jobs
by ownagefool 2459 days ago
Out of interest, what was wrong with it and how did you fix it?

In 4 years I've never came across a cluster I was unable to fix, nor has it really broken without someone taking an unadvisable action on it. This may simply be because I started early enough that I was forced to manually configure the components and thus understand the underlying system well enough.

Over time I have seen some interesting things though:

- Changing the overlay network on running servers probably the silliest thing I've done. This wasn't on production, but figuring out where all the files are and deleting them was something pretty undocumented.

- A few years back somebody ran a HA cluster without setting it as HA which resulted in occasional races where services keep changing IP addresses. I believe the ability to do this was patched out.

- An upgrade caused a doubling of all pods once. This was back when deployments were alpha/beta and they changed how they were references in the underlying system, causing deployments to forget their replicasets, etc.

Overall though, in 4 years I've spent very little time debugging clusters and more time debugging apps, which is what we want.

2 comments

> nor has it really broken without someone taking an unadvisable action on it

You’re basically saying “the tool X is fine, you’re just inexperienced/undisciplined and using it wrong”. Which is fair critique if I was an intern, but I have a decade+ experience in development and operations and I look at kubernetes in disbelief - why should things be that complicated? I get it, everything is pluggable and configurable, but surely this must be balanced out by making it more approachable and convenient?

You can’t sneeze in kubernetes without it requiring you to generate some ssl certs to the point where it’s just cargo-culture without any consideration of purpose and security.

And what’s up with dozens and dozens of bloated yamls and golang files? The fresh 30-odd commits ”official” flink operator is 3 THOUSAND lines of Go and 5 THOUSAND lines of yamls. How is that reasonable? In which universe is that reasonable? all it does is a for-loop that overwrites a bunch of pods to keep their spec in sync with desired config. There’s like 1000:1 boilerplate ratio in kubernetes and it’s considered good somehow?

Sorry for the rant, I’m just angry that we’re six decades into software engineering and the newest hottest project I the newest hottest line of work behaves like everybody should be paid per line of code they produce.

Not sure I'd actually even responded to you, but that's not at all what I was saying.

You can have a decade tech experience and still not know another system well. We all forget the learning we did to get to where we are, but I'm sure all the old reliable tools were frustrating at one point too.

Personally, I don't find kubernetes that complex, but then I did write and setup a schedulers for an early IaaS provider, so maybe I'm just comfortable with the problem, or maybe it's simply because I've been using it for several years.

Flink is shit software. You're right, those things are ridiculous. They're an indication that something is wrong, and you pushed ahead anyway. Your problems are your own.
I wrote a long reply to someone else’s question below that should answer your question :)
It's interesting, because many of your problems there are relatable to the simpler deploy discussed by the parent. I'd be no wiser debugging your bespoke ansible script, and likely neither would you, if not for the fact you've written it.

Don't get me wrong, debugging overlay networking issues isn't something to love, but it's also not all that complex:-

- There's a worker daemon on every box that manages the local configuration, whether thats IPtables, IPVS, BPF or something else. There may be a seperate worker for service IP addresses than pod IP addresses.

- There's a controller that does the actual figuring out what things should be doing and lays out the rules for the workers. This might include network policy controller, but this might be in a seperate daemon.

This setup enables Service IPs, Pod IP addresses & Network Policy.

Obviously in ansible you can just write your own firewall rules, but as soon as you step away from running every app on every box, you'll either be relying on something as complex (but managed by someone else) like the cloud providers SDN, or you'll need to run your own system that does the same.

As much with anything, it depends what you're doing, but I like auto recovery, app level health checks, infrastructure as code, namespaces, resource quotas, and don't want to force my dev teams to couple their network policies with infrastructure details, so I'm fairly happy with the abstraction.