Hacker News new | ask | show | jobs
by markstos 1439 days ago
This was even worse than the headline made it sound.

If you had `unattended-upgrades` running and had the "automatic reboot" option enabled, then all your Ubuntu 20.04 servers running Docker would reboot themselves and not come back up.

First, the bug was in a security branch. Second, it wasn't just the containers that crashed. If you booted containers on boot via Docker, then the host OS kernel-panicked and crashed at boot, since the containers share the kernel with the host.

At that point, you can't SSH in and have to follow the procedure for restoring from backup or re-mounting the root volume on an alternate house to revert the kernel version being run.

And then of course if you revert the kernel upgrade, you were once again vulnerable to whatever problem the security update was fixing...

7 comments

Sounds about right. And not the first time it happens either. I recall getting a few of those instant unit 3 panic over the past few years with Ubuntu. Often with things not as common out there in production, like tc (which in our case we were using in production to work around conntrack race conditions), and sometimes we also got non-panicking but absolutely production/nerve wrecking issues like TCP window size calculation overflows after the window went to zero due to a temporary slow consumer - freezing the window size to a few bytes only instead of getting a prompt full window recovery.

Not to mention we’ve also had our fair share of production triple faults from bugs in the Intel firmware patches for Spectre, which took weeks to investigate & fix between ourselves struggling to keep our exchange up & running, Intel, and AWS.

And that is why there’s value in the CoreOS/ContainerLinux-like solutions we designed & implemented nearly a decade ago now. Being able to promptly rollback any kernel/system/package upgrades at once - either manually or either after it’s detected a few panics in quick successions is actually quite awesome. Not to mention the slow update rollout strategy baked into the Omaha controller.

But the reality is that the what-ifs are always the hardest to market, nearly always after-thoughts and with fast-spiking/fast-decaying traction after major events.

It really seems like there’s no good non-redhat (but still “production capable”) alternative to CoreOS nowadays, right? It’s pretty much Fedora / Redhat CoreOS or go directly to things such as k3os?
The rancher stack is pretty amazing.

Elemental is pretty close to coreos: https://github.com/rancher/elemental/

They even have a way to build arbitrary os images: https://github.com/rancher/elemental-toolkit

It's pretty great

k3os is in a dieing limbo, now is the time to get some interest in using stuff like it
I know it’s too late for a bunch of shops but for gods sake please don’t use unattended upgrades to do your patching unless you want to hate you life and chase down hard to find hard to undo bugs.

Build your images in CI job and have your deploy version be (code version, image version) so patching runs through all the same tests your code does and you have a trivial roll-forward to undo any mess you find yourself in.

> don’t use unattended upgrades

> Build your images in CI job

I know container images should generally be immutable, but I would expect unattended upgrades to be mostly used on the host, not in a container, in which that management system doesn't really work (unless you're doing VMs where you can deploy immutable root images to the VMs as well, or some fun bare metal + PXE combination).

alternatively I suppose depending on the size of your operation, you should consider having a dummy prod using at least one of each of the servers in your environment and using that to validate host upgrades. after that you can push an unattended upgrade via a self-hosted package+upgrade server.

Let things be automatic to the maximum degree possible but give yourself a single hard human checkpoint and some minimum level of validation in a dummy environment first.

Idea is that your deploy step should handle both deploying code as well as upgrading OS, so all changes go through same pipeline.
> or some fun bare metal + PXE combination

This is actually what I implemented for our hypervisor tier, it’s not as scary as it sounds. I could legit completely rebuild our entire stack down to the metal in about 3 hours.

Kick off a new hypervisor version, the inactive side PXE boots all the nodes, installs and configures a Proxmox cluster, slaves itself to our Ceph cluster, and then either does a hot migration of all the VMs or kicks off a full deploy which rebuilds all the infra (Consul, Rabbit, Redis, LDAP, Elastic, PowerDNS, etc) along with the app servers. The hardest part (which really isn’t) is maintaining the clusters across the blue/green sides.

With this setup our only mutable infrastructure was our Ceph cluster (because replacing OSDs takes unacceptably long) and our DB (for performance the writers lived on dedicated servers, the read replicas lived on the VMs.).

Sorry, not my experience.

My experience has been that by the time I notice some serious vulnerability is in the news, my servers have already patched themselves. I have never "hated life" or had a "hard to find and undo bug" due to automatic security patching. I pretty quickly found what caused this and had a clear path to resolution.

This is the first security update that caused a boot failure in about a decade. It was bad, but it didn't change my mind about unattended-upgrades. My takeaway that if that maybe I should have upgraded my 20.04 servers to 22.04 server sooner.

You’re conflating unattended-upgrades (server mutability, hard to roll back) with automated patching in general. Do automated patching but also run the changes though your CI so you can catch breaking changes and roll them out in a way that’s easy to debug (you can diff images) and revert.

I bet when you update your software dependencies you run those changes through your tests but your OS is a giant pile of code that usually gets updated differently and independently because mostly historical reasons.

> I bet when you update your software dependencies you run those changes through your tests but your OS is a giant pile of code that usually gets updated differently and independently because mostly historical reasons.

Close. We are moving towards defining our server states through Ansible, but the project is not close to completion. Perhaps once that's further along, we could use Ansible Molecule + CI to test a new server state when there's a new patch available, but that's not an option on the table today.

The system we had in place for /today/ worked: Lower priority or redundant servers were set to auto-reboot after applying security updates, while other critical servers require manual reboot at low-risk times. By then, the patch has already been tested on lower-risk servers.

As a result, this issue caused no user-visible downtime for us, and due to the staggered runs of unattended-upgrades affected a minimal number of servers.

And this was the first time in 10+ years that something like this happened and we have to choose to write to prioritize spending our process-improvement time based on likelihood and impact.

>I know it’s too late for a bunch of shops but for gods sake please don’t use unattended upgrades to do your patching unless you want to hate you life and chase down hard to find hard to undo bugs

Some years ago everyone said the same about windows-servers ;)

> have to follow the procedure for restoring from backup or re-mounting the root volume on an alternate house to revert the kernel version being run.

Or add `systemd.mask=docker.service` to your boot parameters to prevent Docker from starting.

Which, if your server is stuck in a infinitive "boot -> docker starting -> container starting -> crashing kernel -> reboot" loop, you won't ever get a chance of actually adding anything to your boot parameters.
If you have access to the console (local physical machine, VM on a system that can expose the console, physical box that you have console access to via IPMI or other means), can you not specify that directive to be passed through via grub's interactive menu?

Failing that you could try the “single” directive and poke other configurations once booted in that mode.

A faf to be sure, but hopefully viable options (assuming the interactive menu hasn't been disabled to save a few seconds off boot time!).

Absolutely can, I'm quite surprised at the 'what do' attitude around this. It's routine -- not in all organizations to be sure, but it's a solved problem.

There are options even without out of band management. You can choose to configure your systems with PXE -- if the installation ever fails, it can boot into a recovery environment over the network.

That's not correct. If you stop the boot you can add 'single' to the boot statement which will drop you in a single user shell from where you can do quite a bit of maintenance.
AWS at least provides serial console access so have the option to access it during the boot cycle.

Alternatively, you umount the drive, attach it to another machine, chroot into it, fix grub or whatever, reverse the process and boot again. It's a few steps, but can be done in a few minutes with practice.

Out of band management is common and highly recommended
Actually networking and ssh come up for a couple of seconds before containerd triggers the kernel panic so you can fix it by doing this:

while true; do ssh <servername> sudo mv /usr/bin/containerd /usr/bin/containerd.backup ; sleep 1; done

While rebooting the system

Sorry, what? You don't lose complete control of a server because it's rebooting nonstop.
Wouldn't rollback of kernel be a choice in grub menu?

It's pretty standard for all distros to have that choice.

That usually requires physical access to the server to select it during boot.
If you have unattended-upgrade and automatic reboot in the cloud to benefit from security updates for long-lived instances, then you better make sure to have a tty console attached to it. You are treating it like a physical machine, you must have the same tooling around.
Not really, console access through IPMI found on most servers

Exceptions tend to be white boxes built with desktop components, at which point, yea. The proverbial You asked for this problem

Not necessarily. With good timing and some luck, you can connect the serial/"recovery" console before GRUB's timeout ends and either change the running kernel or add the `systemd.mask=docker.service` boot parameter to prevent Docker from starting.
Sounds like a VM and not a physical server.
Nope. Back before VMs were thing it was common to do "lights out" style remote management via a console server. That console server would then have a serial connection (the old 9 pin d-sub plug[1]) to your individual physical servers. You could then connect to your remote servers local TTY via the console server a little like jumping to remote servers via an SSH bastion. However it did sometimes require a little bit of prior configuration, depending on your distro[2].

This wasn't just limited to Linux either. It was a common UNIX trick :)

This is a bit of a lost art these days though. iLo, IPMI have replaced the need for serial. Then virtualisation and, to a lesser extent, containerisation have lowered the bar even further plus also moving the industry towards more ephemeral systems that can be destroyed and rebuilt automatically rather than the old habits of nursing failed hosts back to health.

[1] https://duckduckgo.com/?q=9+pin+d-sub+plug&t=newext&atb=v316...

[2] https://www.kernel.org/doc/html/v5.3/admin-guide/serial-cons... (a lot of distros at the time did ship a kernel with this support compiled in. I don't know how common it is now).

And quite a few implementations actually emulate the serial console allowing for the exact same access. (Serial Over Lan or SOL for short.)
Still common on network devices (Cisco, Juniper, Arista etc.). No IPMI or similar on those.

Console servers from the likes of OpenGear and Lantronix still heavily used for those.

Sure. For a physical server, you'd use its lights-out management to the same effect.
If its in the cloud you'd have a virtual console.
Unsurprisingly AWS is ghetto about this.
Or a real server with Lights Out Management.
That’s why “usually” is in the sentence. :)

Most smaller teams usually don’t prioritize physical access — they usually only need it for one-off events. While this would be a one-off event, it would be one that affects many servers.

I'd be more inclined to say that physical servers usually have some sort of console access available.

I'm not sure I've ever worked with any (2008-present) that don't in any case.

That is really not my experience at all. Every professional smaller team I worked with "usually" had this figured out and set up. In times of home office, no one wants to be at the office for just pressing a single button on some server.

Oh well, I guess experiences differ.

My experiences for ops is all pre-2012 and with teams numbering less than 3 for the whole org. So I’m sure things have changed or gotten cheaper? I can’t see a team of 3-4 having the budget to get something that allows them to be “lazy”, especially when that budget can go towards something useful. But I guess the pandemic probably changed things there?
You can do this kind of thing across the network if you have to.
no.

it requires acces to the serial console or baseband management controller or whatever terms have emerged.

have never rented a physical server w/o this.

> If you had `unattended-upgrades` running and had the "automatic reboot" option enabled, then all your Ubuntu 20.04 servers running Docker would reboot themselves and not come back up.

Isn't the common wisdom that you should have them enabled, but staggered across hours/days?

Not a huge Debian/Ubuntu user but I think the systemd timer that triggers the unattended updates has a random delay added to it. I don't know of it's hours or just seconds.
I believe it's staggered across hours by default and it seems that Canonical might have been able to at least stop pushing out the bad update even before they had a fix
Probably better you have rolling A/B replacements that stop the replacement run if the replacement doesn't come up.

This is mostly an in-place upgrade issue?

And while I haven't had this happen to me yet; the fear of something like this or even worse is why I try to stay one step behind the update paths on linux distros.

Security patches matter, but I'm no one important, so I should be fine to wait a week or month...

Anyone else who is important though... servers for example...

That sounds downright horrible !