Hacker News new | ask | show | jobs
by dpryden 2124 days ago
I'm confused about how the documentation recommends using a Kubernetes operator to manage OS updates. That seems weird and backwards to me. I would rather see an immutable OS AMI in an auto-scaled group, and just replace the node instance whenever there is an update.

I can see a place for managing OS updates on an instance, but that seems more like "pets" than "cattle"... and I've always treated Kubernetes nodes like cattle, not pets. Isn't that the most common approach anyway?

7 comments

We use 5000+ CoreOS nodes in production and never want to go back to replacing VMs with new images for each update again. In-place immutable updates are more efficient and faster. Unlike RPM based OSes that are hard to patch, transactional updates provide a safe way to perform safe in-place updates instead of wasteful operations such as replacing full VMs for small OS updates.
As a former CoreOS (now Red Hat) employee, I'm glad to hear of your success with the product and it's awesome to see that scale giving you time back in your day to concentrate on the big picture.
Not at all happy with things right now. You guys abruptly killed CoreOS which was great and left us to fend for ourselves. Fedora CoreOS doesn’t cut it and we don’t want open shift. We will need to move to either bottle-rocket or flatcar. A team member evaluated both and we leaning towards b-rocket since flatcar has a higher risk of dying by running out of funding during COVID (seen this movie before with CoreOS). Amazon’s customer support team is also a lot better.
What does not cut it in Fedora CoreOS ? (Just Curious as a Fedora contributor) Also what could be improved ?
it's too late to improve.

    1. update wasn't possible - that basically means that customers can't count on you
    2. some decisions favored OpenShift instead of being general purpose
    3. you rebuild the whole stack?! from scratch?! 
    4. the documentation is twice as bad as from CoreOS (which was already pretty bad)
    5. the market now has more to offer, way better solutions. Fedora CoreOS is basically just a rebuild CoreOS that brings nothing new to the table and shaken up all customers. if I need to reinstall my nodes anyway I can evaluate k3os, bottlerocket, opensuse microos, rancher, whatever which didn't sleeped while you guys did rebuild your stuff.
these are probably the worst things
Have you guys taken a look at Talos?
I agree with you. This seems more complex than just having a auto scale group that auto rotates nodes after a certain amount of time and just picking a new update when the node launches.
I can provide a little background on this. In general yes I would recommend that you just use an ASG and roll out a new AMI. However that approach can be very expensive and time-consuming at truly massive scale (1000's or even 10's of thousands of machines).

Bottlerocket is built in part based on our experiences operating AWS Fargate, which obviously has as one of its needs the ability to patch a colossal number of hosts which are running people's containers, without downtime or disrupting their containers. Bottlerocket is designed to ensure that this is both efficient and safe. We aren't the only ones with this need. Many large orgs also have tremendous fleets, and its unacceptable to cause significant disruption by rotating at the host level.

Another aspect to consider is stateful workloads that are using the local disks. Bottlerocket lets you safely update your host if you are running something like a database or other stateful system where you don't really want to move your data around.

Not everyone will need to use this updating mechanism, but I think it will be very attractive to many of the larger organizations with a lot of infrastructure.

A little confused, you say you don't want to disrupt the containers, but https://github.com/bottlerocket-os/bottlerocket#updates seems to indicate you still have to reboot for the update to take hold?
I agree with the confusion. There is the ability to rollout updates in a "wave"[1], but I'm not sure how this is better than a simple rollout strategy in kubernetes since a reboot of the node seems inevitable.

[1] https://github.com/bottlerocket-os/bottlerocket/tree/develop...

It seems to me, not to be combative, that if Fargate can't afford the "noschedule: node is old" overhead and customers of Fargate can't handle their containers restarting on a regular basis, there's something wrong with your management engine or with their design and implementation. Much of the point of containerization is that you can roll containers often and run enough of them that you never have a single point of failure. What part of that assumption is broken that destroying machines regularly doesn't work?
There are any number of reasons to avoid restarting things. Some customers are running code that has a cold start and needs some time to warm up its cache if it restarts. Some customers are running jobs (video rendering, machine learning training, etc) that might take literally days to complete. Interrupting these jobs and causing them to restart wastes the customer time and causes them to lose progress. Other containers may be hosting multiplayer game servers, and forcing them to restart would cause all people logged into the game instance to get disconnected or otherwise dropped from their game.

All of the above are use-cases that AWS Fargate is used for. Beyond this many folks simply don't like it when things happen unexpectedly outside of their control. We have Fargate Spot for workloads that can tolerate interruption, and we discount the price if you choose this launch strategy. However Fargate on-demand seeks to avoid interrupting your containers. You are in control of when your containers start and stop or autoscale.

This makes a ton of sense and I appreciate the response. I think what people aren't recognizing is that cloud services make you pay for performance, so doing things like relaunching containers which have slow warmup time literally costs extra money. While it's certainly important to design systems such that the containers can be tossed aside easily, that doesn't mean there isn't value in reducing how often that tossing aside occurs.
Forgive me my hijack

Any plans to reduce the minimum bill time for Fargate to accommodate short tasks?

With 1 minute minimum billing you have to turn to lambda for very short tasks or have a long running Fargate consuming tasks from some message bus.

If you choose lambda, your containers don’t work so you need to rebuild your runtime with lambda layers or ebs or squeeze into the lambda env.

If you choose messaging, say SQS from a lambda called by API gateway you’ve complicated your architecture and your Fargate instance is potentially hanging out billing, idle, and waiting for messages.

Fargate spot removed the last reason to consider AWS Batch. Short tasks could largely replace lambda.

It would be nice to Fargate all the things.

This stuff is probably waaaay over my head, but isn't that why SIGTERM was made for ? To notify a running process that the host needs to be shutdown/restarted and to let the running process finish it's current task (current frame encoding / current multiplayer game / current request / ...) and that the state / cache / progress / ... needs to be saved.

The process on aws side would then be : send SIGTERM to all workloads. wait for [configurable] amount of time (maxed at xx hours) or until all workloads have exited (whichever comes first). Shutdown the node. Update the node. Start the node. Restart the workloads.

Yep you are right about SIGTERM, but let's think back to the original reason why we wanted to update the node: because of a patch, probably a security patch for a CVE?

What is the better option here? Implement a SIGTERM based process that allows the user to block the patch for a critical, possibly zero-day CVE for xx hours, remaining in a vulnerable state the entire time? Or implement a system that just patches the underlying host without interrupting the workloads on the box?

You aren't wrong, what you described is a possibility, but it is not the best possibility.

Nothing is "broken" about it. It's just that when you have tens of thousands of machines that might need an urgent security update, it's very inefficient and costly to destroy all of them at once instead of patching. Destroying machines regularly is not the same thing as frequently destroying all of them at once.
Constantly rotating nodes in and out of the cluster and restarting/relocating pods, even if mostly automated, causes a lot of needless infrastructure strain. It is IMO one of the most overlooked parts of Kubernetes, and I wish there was a better solution to maintain stable, long-running processes when needed.
If your pods need to be long running, you can annotate them as such and they will not be autoscaled.

https://github.com/kubernetes/autoscaler/blob/master/cluster...

Will they then also not be updated?
Not automatically. The autoscaler and other tooling will wait for the Pod to complete execution, or you can manually force the upgrade.
This is how OpenShift 4 does things. I too thought it was strange at first but now with some experience it's quite pleasant.

Can be a beast to debug though if you haven't done it before.

Aside from being faster than replacing all of the hosts, the reason OpenShift does it this way is that you can't just burn down and replace a fleet of bare metal machines. While re-PXEing is possible, this takes a ton of time and stresses that infrastructure.

Doing the same on cloud, metal, OpenStack, VMware, etc means that your cluster's operational experience remains the same and in most cases is less disruptive.

edit: having your nodes controlled by your cluster has a number of other benefits aside from patching, like the Node Tuning Operator that can tweak settings based on the types of workloads running on that set of machines.

I can assure you that OpenShift doesn't take this path because it is "better". It does so because bare-metal is a significant part of their market and there isn't a better option to automate the process currently.

I once worked on a competing product (before the OS update operator was available) and the update-in-place model was always a disaster. Various problems like dns, service discovery, timeouts, breaking changes to dependency pkgs, etc make for a problematic process. Combine that with the frantic pace of k8s develeopment, short node compatibility window (2-3 minor k8s releases) and various CVEs - you end up debugging a lot of machines in unknown states that fail to rejoin clusters after reboots.

This has definitely not been my experience running many hundreds of Red Hat CoreOS nodes in production.

So far, aside from a few small flakey issues, having the cluster nodes _and_ the OpenShift cluster update in lock step has been dramatically simpler to manage.

Was that with a 1990s mutable package-based distro by any chance?
Why? I view being able to do this as a huge advantage. Don't want to lose instance store state just for an OS update and love the kubernetes operator interface to be able to do this. The kubernetes operator also operates at a cluster level which means we don't need to write scripts to churn ASGs. It is eyebrow raising that they have only enabled this for Kubernetes and not ECS. I suspect that this is one of many signs that that the inferior and lock-in prone ECS service will be deprecated soon.
> I would rather see an immutable OS AMI in an auto-scaled group, and just replace the node instance whenever there is an update.

It sounds like you're trying hard to reinvent Kubernetes while doing your best to avoid mentioning Kubernetes features like Kubernetes operators.

your description made me think of...

https://www.merriam-webster.com/dictionary/fungible