| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nonameiguess 510 days ago

> We assume that most applications handle upgrades inside their applications and not on an infrastructure level. Do you have an exact upgrade case in mind?

Part of my job is handling application installs and upgrades for military customers and contractors to the military who don't have sufficient internal expertise to do it themselves, or ideally creating something like your platform, but unfortunately often bespoke to their specific needs.

A few services that have become ubiquitous for orgs that have to self-host everything are things like Keycloak, Gitlab, the Harbor container registry or something like Artifactory or Nexus if they need to host package types other than just container images.

These aren't my own applications. They're third party. So it'd be great if I could make them single-click, but the reality of self-hosted web services is they tend to provide far too many options for the end-user for that to be possible. For instance, they all require some kind of backing datastore and usually a cache. Most of the time that's Postgres and Redis, but whether you also self-host those or use a managed cloud service, and if you self-host, whether you try to do that also via Kubernetes using Operators and StatefulSets, or whether you run those on dedicated VMs, dictates differences in how you upgrade them in concert with the application using them.

There's also stupid peculiarities you discover along the way. Keycloak has an issue when deployed as multi-instance where the Infinispan cache it runs on its own local volumes that I guess is the standard Java Quarkus cache, may be invalidated if you try to replace instances in-place, so the standard Kubernetes rolling update may not work. Most of the time it will, but when it doesn't, it can bite you. They don't document this anywhere, but I found buried in a Github issue comment from a Keycloak developer two years ago a statement saying they recommend you scale to 0, then upgrade, then scale back to however many replicas you were running before.

The stupidest example so far is a customer who wanted to self-host Overleaf, a web-based editor for collaborative LaTeX documents. Overleaf itself only provides a docker compose toolkit, but it's not as simple as running "docker compose up." You run an installer and upgrader script instead that injects a bunch of discovered values from the environment into templates and then runs docker compose up. Basically, they just recreated Helm but using docker compose. Well, this customer is running everything else in Kubernetes. At one point, they had a separate VM for Overleaf but wanted to consolidate, so I wrote them a Helm chart that comes as close as possible to mimicking what the docker compose toolkit does, but I'm not the developer of Overleaf and I can't possibly know all the peculiarities of their application. They use MongoDB and the AWS equivalent managed service last I remember was not available in us-gov-east-1 or possibly not available in either Gov Cloud region, so we needed an internal Mongo self-hosted. I'm not remotely qualified to do that but tried my best and it mostly works, except every time we upgrade and cycle the cluster nodes themselves and Mongo's data volume has to migrate from one VM to another, the db won't come back up on restart except with some extra steps. I scripted these, but it still results in yet one more command you need to run besides just "helm upgrade."

Gitlab is its own beast. If you run it the way they recommend, you install Gitaly as an external service that runs on a VM, as they don't recommend having a Git server actually in Kubernetes because memory management in the Kubelet and Linux kernel don't work well with each other for whatever reason. We had no problem for years, until their users started using Gitlab for hosting test data using LFS and pushed enormous commits, which brought down the whole server sometimes until we migrated to an external Gitaly that runs on a VM.

But that means upgrading Gitlab now requires upgrading Gitaly separately, totally outside of a container orchestration platform. Also, Gitlab's provided Helm chart doesn't allow you to set the full securityContext on pods for whatever reason. It will ignore the entire context and only set user and group. So when you run with the restricted PSA configuration, as every military customer is going to do, you can't do a real Helm install. You need to render the template, then patch all of the security contexts to be in compliance, then apply those. Ideally, a post-render would do that, which is what it's supposed to do, but I could never get it to work and instead end up having to run helm template and kustomize as separate steps.

It's hellacious, but honestly I can't think of a single application ever that was as simple as it should have been to upgrade, which is why companies and government orgs end up in the ridiculous situation of having to hire an external consultant just to figure out how to upgrade stuff for them, because they would not have understood the failure modes and how to work around them.

It'd be nice if the applications themselves could just provide an easy button like you seem to be trying to offer, but the reality is once they have even just an external datastore, and they allow self-hosting, they can't do that, because they need to support customers running that in the same cluster using sub-charts, in the same cluster but using separate charts, using a cloud platform's managed service, self-hosted but on its own VM or bare metal server, possibly some combination of both cloud and on-prem or multi-cloud. A single easy button can't handle all of those cases. Generally, this is probably why so many companies try to make you use their managed services only and don't want you to self-host, because supporting that very quickly becomes a problem of combinatorial explosion.

1 comments

pmig 510 days ago

Interestingly this is the problem we initially tried to solve. We build a Kubernetes Operator (back in the day in Kotlin) that does the full life cycle management of these kind of apps. Including management of CRs for database upgrades etc.

This project still runs at customers in production, but ultimately - as you pointed out - individual configurations became to big of a problem as it was not feasible to build a new operator versions for configuration deviations. It was a a fun project though: https://github.com/glasskube/operator

link