Hacker News new | ask | show | jobs
by smarterclayton 1208 days ago
Re out of order:

Is https://kubernetes.io/docs/concepts/workloads/controllers/st... unsuitable for that?

1 comments

Unfortunately, while rolling updates account for some scenarios, they are not sufficient for handling out of order restarts where the order cannot be pre-determined. There’s probably some hack you could build with partitioning to mostly address the cases I am thinking of, but it isn’t elegant or guaranteed correct.

This will be a problem for any database where clustering is synchronous and a specific primary node must start first on a full cluster restart. There are other out of band hacks you can do with reassigning PVCs, but it’s never elegant in the current primitives provided.

During my work in this problem space I became convinced that primitives for stateful applications in K8S were built specifically without considering databases as a valid use case. Everything else is just hacks after the fact to make it “work”.

Since I helped design them, I take some issue with that :). Certainly we never expected they would completely solve problems for the database, but they were definitely intended to provide guarantees that simplify normal consensus operations and prevent accidental confusion with non perfect databases.

If a specific primary must start first, that’s partially what ordinals were intended to allow (0 is your primary, the others are always 1-N, and kube is responsible for ensuring the primary is never reassigned). I’d love to take feedback about places where the primitives are unusable, or ways they can be improved, because there are always new tools to add.

I am not working in this problem space any longer, but it’s likely we’ve crossed paths. I have previously presented some suggestions to the Storage SIG on this topic. Feel free to reach out to me with the info in my profile and I can get back to you with a more detailed write up on the specific challenges that I would like to see addressed in StatefulSet, however I am currently traveling so my response will be delayed.
>"I have previously presented some suggestions to the Storage SIG on this topic."

Ia there any chance you might be able to provide a link to these? I would be curious to take a look.

You are talking about things like the "All Nodes Go Down Without a Proper Shutdown Procedure" example in https://galeracluster.com/library/documentation/crash-recove... right?

To handle this case, some teams may have a manual runbook, some teams may have some automation with ansible, some teams may have nothing. So, if someone can come up with some hacks and package that into a k8s operator, it is still a win. It seems to be the best primitives we have at the moment.

That is exactly one of the scenarios I am thinking about. Yes, there are hacks with PVCs that make recovering from this possible today with StatefulSet.
In that scenario it looks like members must coordinate to identify the highest committed transaction (identifying the list of valid members) and then bootstrap from that member?

Stateful Sets were designed to standardize two hard problems: being able to identify all the valid members (pods identified by number that are running at most once on any node) and give admins the button to decide a member was never coming back (the force delete pod / force delete pv action). That was true black magic before - everyone did it their own way. So we worked with the ecosystem to enable vendors/communities/individuals to map those primitives into specific solutions, but did it somewhat deliberately as “we have to build this together”.

What I think the gap has been is that there is significant friction in between the three realms of expertise - knowing what kube is providing, knowing how to map that to a specific problem like translating the Galera runbook into operator/script logic, and then communicating that to the teams that will be accountable for reacting. Vendors have incentives to make you pay for that expertise (or may not have it), large organizations hire people to provide it (most large db on kube deployments are also large tech companies), and in between you have a lot of uncertainty and knowledge gaps that doesn’t necessarily transfer, and that is what drives “Kube isn’t great for stateful”.

It’s ironic to me because StatefulSets were intended to take advantage of those incentives to help the ecosystem scale, and the result is “worse is better” in that many more people can do state on HA DBs than were successful on VMs or metal, but it doesn’t mean they’re completely successful and when people hit the rough edges it hurts. We can do better (that’s partially my day job), but there is a lot of pain people have taken so far that was probably unnecessary. You should use managed DBs if you can - and when you don’t, Kube should be the best alternative it can be (which isn’t far - many DB SaaS uses some kube), and that’s what we need to focus on.