Hacker News new | ask | show | jobs
by 0xbadcafebee 869 days ago
You aren't a real K8s admin until your self-managed cluster crashes hard and you have to spend 3 days trying to recover/rebuild it. Just dealing with the certs once they start expiring is a nightmare.

To avoid chicken-and-egg, your critical services (Drone, Vault, Bind) need to live outside of K8s in something stupid simple, like an ASG or a hot/cold EC2 pair.

I've mostly come to think of K8s as a development tool. It makes it quick and easy for devs to mock up a software architecture and run it anywhere, compared to trying to adopt a single cloud vendor's SaaS tools, and giving devs all the Cloud access needed to control it. Give them access to a semi-locked-down K8s cluster instead and they can build pretty much whatever they need without asking anyone for anything.

For production, it's kind of crap, but usable. It doesn't have any of the operational intelligence you'd want a resilient production system to have, doesn't have real version control, isn't immutable, and makes it very hard to identify and fix problems. A production alternative to K8s should be much more stripped-down, like Fargate, with more useful operational features, and other aspects handled by external projects.

5 comments

It's kind of the modus operandi of Kubernetes since inception. The core model is okay, but ops was always a barely constructed afterthought. And the network stack (kube-proxy) was literally a summer of code project.

I'm thinking a lot of that was by design - both Redhat and Google had incentives to get you onto their value-add to get an actual production ready system.

It also created an entire cottage industry, although much of this has faded as everyone moved to purely managed solutions. Because anything else is absolutely insane.

I’m not sure if it’s intentional. I don’t find the other container orchestrators that much better either.

No one ever cares about making tooling in any software project. You’re always using something by a dead-ass random third-party.

Microsoft is probably the company where I am actually using Microsoft-made tools to manage Microsoft-made products. And maybe Adobe back in the day.

In the bad old days of self-managing some servers with a few libvirt VMs and such, I’d have considered a 3-day outage such a shockingly bad outcome that I’d have totally reconsidered what I was doing.

And k8s is supposed to make that situation better, but these multi-day outage stories are… common? Why are we adding all this complexity and cost if the result is consumer-PC-tower-in-a-closet-with-no-IAC uptime (or worse)?

I've been running Kubernetes in production for two years and have never experienced anything remotely close to this. The worst is a node dies every now and then and, on a rare occasion, a workload doesn't happily migrate.

Of course, my experience is in no way authoritative, but referencing this type of incident as common is pretty foreign to me and may be mostly relegated to self-managed clusters.

GKE since 2017 here. Healthcare. I think we had one major outage that involved the cluster itself. It resolved itself and we never discovered what caused it. That was in the early days, so I recall very little.

Now I'm using Fly.io. They both have their advantages. Folks tend to make kubernetes sound way more difficult than it is. It can be overkill but it can also solve so many challenges out of the box. At least when it's managed. It'll cost you though.

> may be mostly relegated to self-managed clusters.

Foreign to me too, but not surprising people report issues as common. there are a lot of footguns in kubernetes that come from a lack of understanding.

You can build a robust kubernetes cluster that hosts an application that’s nearly impossible to bring offline without an act of god, it just takes some know-how and a tiny bit of effort/experience.

> And k8s is supposed to make that situation better, but these multi-day outage stories are… common? Why are we adding all this complexity and cost if the result is consumer-PC-tower-in-a-closet-with-no-IAC uptime (or worse)?

I'm honestly convinced it's half CV-driven development, and half just the fact that it's become the standard workaround for Python dependency hell. Python is still the easiest way to write software, and it's still basically impossible to make an application that works reliably on more than one machine because of how Python dependency management works (or rather doesn't), so you have to use Docker, and apparently Kubernetes is the standard way you deploy Docker containers.

> apparently Kubernetes is the standard way you deploy Docker containers.

I bet more people actually use docker compose because the buyin is that much smaller.

Anecdata, but in my experience, it's been podman for new deployments. Plenty of old stuff on Docker though. It's easier to grow out of Podman and into k8s than it is to go from compose, to swarm, then k8s. Easier to get buy-in for the ease of Docker from ops, easier to get leadership buy-in on the security of Podman. Such is life.
>Python is still the easiest way to write software

Try dotnet then

There are good things about dotnet (I'm more of a Scala person these days, but I have plenty of respect for F#), but there's nothing in there that lets you get up and running remotely as quickly as Python. (I mean, you don't even get a REPL without doing some messing around)
> and it's still basically impossible to make an application that works reliably on more than one machine because of how Python dependency management works (or rather doesn't)

This is complete bullshit.

I remember we were running 500 solaris zones and 5000 vmware VMs over 2 datacenters with 0 major outages over 4 years. I remember a (single) VM crashing and it was a really big deal (turned out it was a config issue, in retrospect a funny one although our (internal) client lost some data). And I remember we were in "crisis mode" for a couple weeks because of SAN storage issues but there was no client interruption of more than 1 minute over those 2 weeks. One of our client was running our app in a cross-datacenter cluster on bare metal with no interruption for over 20 years.

I'm not advocating for any of those specific solutions and given the choice I would probably use something else, but when I see that my previous CTO wanted kube for single-VM deployments, and a former architect collegue wanted kube for apps that were going to be used by 3 to 5 clients maximum (and in both cases to be run by very small and untrained teams), I think the kool aid has been more than drunk, and I'm now avoiding it like the plague.

Complexity and cost aren't bad when they help produce something of value that we wouldn't have otherwise.

For $150 I can fly round trip from New York to San Francisco, on a massively costly and complex giant noisy metal tube with two blades sticking out the sides that are so strong you could put a tank on each one and the blades still wouldn't droop. Why does it have to be so costly and complex, if I could do something simpler, like take a bus? Well, mostly to keep me from dying. But also to carry lots of luggage, keep costs down, and get me there 15x faster.

K8s does provide great value (as a dev tool), but lacks value in production features, and its design is shit. So I wouldn't say complexity and cost are the downside; it's the lacking production value that's the downside.

Personally, I’m a big fan for QA review sites. Deploy multiple low traffic full site clones to a cluster and spin them up and down as needed. Manual review, automated scans, etc. It’s great for that use case IMO.

In production I always want dedicated resources though.

Honestly in this day and age rolling your own k8s cluster is negligent. I've worked at multiple companies using EKS, AKS, GKE, and we haven't had 10% of the issues I see people complaining about.
I've picked my fair share of outages on managed k8s solutions. The difference there is once it's hosed, your fate is 100% in the hands of cloud support and well... good luck with that one. The cloud apologists in this thread will ofc try to shame you for not buying into their marketing
if your fate is in the hands of one of the cloud gods, what right does anyone have to blame you for what transpires?

mere mortals are not privy to all of the internal downstream impacts from that public-facing service outage. it would be like shouting into the void and expecting an answer, and, more, liking it.

no, it is easier to recognize one’s place, pay the tithes, and enjoy one god’s blessings and curses alike. do not stray and attempt to please two, it will only end in misery. (three is right out.)

Once your team has upgrades down, everything is pretty rote. This submission (Urbit, lol) seemed particularly incompetent at managing cert rotation.

The other capital lesson here? Have backups. The team couldnt restore a bunch of their services effectively, cause they didn't have the manifests. Sure, a managed provider may have less disruptions/avoid some fuckups, but the whole point of Kubernetes is Promise Theory, is Desired State Mamagememt. If you can re-state your asks, put the manifests back, most shit should just work again, easy as that. The team had seemingly no operational system so their whole cluster was a vast special pet. They fucked up. Don't do that.

this is actually a separate project from urbit, called urb-it https://urb-it.webflow.io/
Different Urbit.
What's drone?
https://www.drone.io/

> Automate Software Build and Testing Drone is a self-service Continuous Integration platform for busy development teams.

Simplest possible CI tool that exists, as far as I'm aware. Gives you just barely everything you need, everything is stupid simple, and it just works.

There's an OSS fork in development (https://woodpecker-ci.org/) but it's far behind in terms of features and stability.