| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by markstos 1486 days ago

This was even worse than the headline made it sound.

If you had `unattended-upgrades` running and had the "automatic reboot" option enabled, then all your Ubuntu 20.04 servers running Docker would reboot themselves and not come back up.

First, the bug was in a security branch. Second, it wasn't just the containers that crashed. If you booted containers on boot via Docker, then the host OS kernel-panicked and crashed at boot, since the containers share the kernel with the host.

At that point, you can't SSH in and have to follow the procedure for restoring from backup or re-mounting the root volume on an alternate house to revert the kernel version being run.

And then of course if you revert the kernel upgrade, you were once again vulnerable to whatever problem the security update was fixing...

7 comments

QuentinM 1486 days ago

Sounds about right. And not the first time it happens either. I recall getting a few of those instant unit 3 panic over the past few years with Ubuntu. Often with things not as common out there in production, like tc (which in our case we were using in production to work around conntrack race conditions), and sometimes we also got non-panicking but absolutely production/nerve wrecking issues like TCP window size calculation overflows after the window went to zero due to a temporary slow consumer - freezing the window size to a few bytes only instead of getting a prompt full window recovery.

Not to mention we’ve also had our fair share of production triple faults from bugs in the Intel firmware patches for Spectre, which took weeks to investigate & fix between ourselves struggling to keep our exchange up & running, Intel, and AWS.

And that is why there’s value in the CoreOS/ContainerLinux-like solutions we designed & implemented nearly a decade ago now. Being able to promptly rollback any kernel/system/package upgrades at once - either manually or either after it’s detected a few panics in quick successions is actually quite awesome. Not to mention the slow update rollout strategy baked into the Omaha controller.

But the reality is that the what-ifs are always the hardest to market, nearly always after-thoughts and with fast-spiking/fast-decaying traction after major events.

stingraycharles 1486 days ago

It really seems like there’s no good non-redhat (but still “production capable”) alternative to CoreOS nowadays, right? It’s pretty much Fedora / Redhat CoreOS or go directly to things such as k3os?

georgyo 1486 days ago

The rancher stack is pretty amazing.

Elemental is pretty close to coreos: https://github.com/rancher/elemental/

They even have a way to build arbitrary os images: https://github.com/rancher/elemental-toolkit

It's pretty great

jcastro 1486 days ago

Try flatcar: https://www.flatcar.org/

Already__Taken 1486 days ago

k3os is in a dieing limbo, now is the time to get some interest in using stuff like it

Spivak 1486 days ago

I know it’s too late for a bunch of shops but for gods sake please don’t use unattended upgrades to do your patching unless you want to hate you life and chase down hard to find hard to undo bugs.

Build your images in CI job and have your deploy version be (code version, image version) so patching runs through all the same tests your code does and you have a trivial roll-forward to undo any mess you find yourself in.

yjftsjthsd-h 1486 days ago

> don’t use unattended upgrades

> Build your images in CI job

I know container images should generally be immutable, but I would expect unattended upgrades to be mostly used on the host, not in a container, in which that management system doesn't really work (unless you're doing VMs where you can deploy immutable root images to the VMs as well, or some fun bare metal + PXE combination).

CuriousCosmic 1486 days ago

alternatively I suppose depending on the size of your operation, you should consider having a dummy prod using at least one of each of the servers in your environment and using that to validate host upgrades. after that you can push an unattended upgrade via a self-hosted package+upgrade server.

Let things be automatic to the maximum degree possible but give yourself a single hard human checkpoint and some minimum level of validation in a dummy environment first.

ec109685 1486 days ago

Idea is that your deploy step should handle both deploying code as well as upgrading OS, so all changes go through same pipeline.

Spivak 1486 days ago

> or some fun bare metal + PXE combination

This is actually what I implemented for our hypervisor tier, it’s not as scary as it sounds. I could legit completely rebuild our entire stack down to the metal in about 3 hours.

Kick off a new hypervisor version, the inactive side PXE boots all the nodes, installs and configures a Proxmox cluster, slaves itself to our Ceph cluster, and then either does a hot migration of all the VMs or kicks off a full deploy which rebuilds all the infra (Consul, Rabbit, Redis, LDAP, Elastic, PowerDNS, etc) along with the app servers. The hardest part (which really isn’t) is maintaining the clusters across the blue/green sides.

With this setup our only mutable infrastructure was our Ceph cluster (because replacing OSDs takes unacceptably long) and our DB (for performance the writers lived on dedicated servers, the read replicas lived on the VMs.).

markstos 1486 days ago

Sorry, not my experience.

My experience has been that by the time I notice some serious vulnerability is in the news, my servers have already patched themselves. I have never "hated life" or had a "hard to find and undo bug" due to automatic security patching. I pretty quickly found what caused this and had a clear path to resolution.

This is the first security update that caused a boot failure in about a decade. It was bad, but it didn't change my mind about unattended-upgrades. My takeaway that if that maybe I should have upgraded my 20.04 servers to 22.04 server sooner.

Spivak 1486 days ago

You’re conflating unattended-upgrades (server mutability, hard to roll back) with automated patching in general. Do automated patching but also run the changes though your CI so you can catch breaking changes and roll them out in a way that’s easy to debug (you can diff images) and revert.

I bet when you update your software dependencies you run those changes through your tests but your OS is a giant pile of code that usually gets updated differently and independently because mostly historical reasons.

markstos 1486 days ago

> I bet when you update your software dependencies you run those changes through your tests but your OS is a giant pile of code that usually gets updated differently and independently because mostly historical reasons.

Close. We are moving towards defining our server states through Ansible, but the project is not close to completion. Perhaps once that's further along, we could use Ansible Molecule + CI to test a new server state when there's a new patch available, but that's not an option on the table today.

The system we had in place for /today/ worked: Lower priority or redundant servers were set to auto-reboot after applying security updates, while other critical servers require manual reboot at low-risk times. By then, the patch has already been tested on lower-risk servers.

As a result, this issue caused no user-visible downtime for us, and due to the staggered runs of unattended-upgrades affected a minimal number of servers.

And this was the first time in 10+ years that something like this happened and we have to choose to write to prioritize spending our process-improvement time based on likelihood and impact.

nix23 1486 days ago

>I know it’s too late for a bunch of shops but for gods sake please don’t use unattended upgrades to do your patching unless you want to hate you life and chase down hard to find hard to undo bugs

Some years ago everyone said the same about windows-servers ;)

akx 1486 days ago

> have to follow the procedure for restoring from backup or re-mounting the root volume on an alternate house to revert the kernel version being run.

Or add `systemd.mask=docker.service` to your boot parameters to prevent Docker from starting.

capableweb 1486 days ago

Which, if your server is stuck in a infinitive "boot -> docker starting -> container starting -> crashing kernel -> reboot" loop, you won't ever get a chance of actually adding anything to your boot parameters.

dspillett 1486 days ago

If you have access to the console (local physical machine, VM on a system that can expose the console, physical box that you have console access to via IPMI or other means), can you not specify that directive to be passed through via grub's interactive menu?

Failing that you could try the “single” directive and poke other configurations once booted in that mode.

A faf to be sure, but hopefully viable options (assuming the interactive menu hasn't been disabled to save a few seconds off boot time!).

bravetraveler 1486 days ago

Absolutely can, I'm quite surprised at the 'what do' attitude around this. It's routine -- not in all organizations to be sure, but it's a solved problem.

There are options even without out of band management. You can choose to configure your systems with PXE -- if the installation ever fails, it can boot into a recovery environment over the network.

jacquesm 1486 days ago

That's not correct. If you stop the boot you can add 'single' to the boot statement which will drop you in a single user shell from where you can do quite a bit of maintenance.

markstos 1486 days ago

AWS at least provides serial console access so have the option to access it during the boot cycle.

Alternatively, you umount the drive, attach it to another machine, chroot into it, fix grub or whatever, reverse the process and boot again. It's a few steps, but can be done in a few minutes with practice.

bravetraveler 1486 days ago

Out of band management is common and highly recommended

mrintegrity 1485 days ago

Actually networking and ssh come up for a couple of seconds before containerd triggers the kernel panic so you can fix it by doing this:

while true; do ssh <servername> sudo mv /usr/bin/containerd /usr/bin/containerd.backup ; sleep 1; done

While rebooting the system

gtirloni 1486 days ago

Sorry, what? You don't lose complete control of a server because it's rebooting nonstop.

_y5hn 1486 days ago

Wouldn't rollback of kernel be a choice in grub menu?

It's pretty standard for all distros to have that choice.

withinboredom 1486 days ago

That usually requires physical access to the server to select it during boot.

darkwater 1486 days ago

If you have unattended-upgrade and automatic reboot in the cloud to benefit from security updates for long-lived instances, then you better make sure to have a tty console attached to it. You are treating it like a physical machine, you must have the same tooling around.

bravetraveler 1486 days ago

Not really, console access through IPMI found on most servers

Exceptions tend to be white boxes built with desktop components, at which point, yea. The proverbial You asked for this problem

akx 1486 days ago

Not necessarily. With good timing and some luck, you can connect the serial/"recovery" console before GRUB's timeout ends and either change the running kernel or add the `systemd.mask=docker.service` boot parameter to prevent Docker from starting.

withinboredom 1486 days ago

Sounds like a VM and not a physical server.

laumars 1486 days ago

Nope. Back before VMs were thing it was common to do "lights out" style remote management via a console server. That console server would then have a serial connection (the old 9 pin d-sub plug[1]) to your individual physical servers. You could then connect to your remote servers local TTY via the console server a little like jumping to remote servers via an SSH bastion. However it did sometimes require a little bit of prior configuration, depending on your distro[2].

This wasn't just limited to Linux either. It was a common UNIX trick :)

This is a bit of a lost art these days though. iLo, IPMI have replaced the need for serial. Then virtualisation and, to a lesser extent, containerisation have lowered the bar even further plus also moving the industry towards more ephemeral systems that can be destroyed and rebuilt automatically rather than the old habits of nursing failed hosts back to health.

[1] https://duckduckgo.com/?q=9+pin+d-sub+plug&t=newext&atb=v316...

[2] https://www.kernel.org/doc/html/v5.3/admin-guide/serial-cons... (a lot of distros at the time did ship a kernel with this support compiled in. I don't know how common it is now).

jacquesm 1486 days ago

And quite a few implementations actually emulate the serial console allowing for the exact same access. (Serial Over Lan or SOL for short.)

topranks 1486 days ago

Still common on network devices (Cisco, Juniper, Arista etc.). No IPMI or similar on those.

Console servers from the likes of OpenGear and Lantronix still heavily used for those.

akx 1486 days ago

Sure. For a physical server, you'd use its lights-out management to the same effect.

ape4 1486 days ago

If its in the cloud you'd have a virtual console.

zurn 1485 days ago

Unsurprisingly AWS is ghetto about this.

taspeotis 1486 days ago

Or a real server with Lights Out Management.

withinboredom 1486 days ago

That’s why “usually” is in the sentence. :)

Most smaller teams usually don’t prioritize physical access — they usually only need it for one-off events. While this would be a one-off event, it would be one that affects many servers.

corobo 1486 days ago

I'd be more inclined to say that physical servers usually have some sort of console access available.

I'm not sure I've ever worked with any (2008-present) that don't in any case.

phillu 1486 days ago

That is really not my experience at all. Every professional smaller team I worked with "usually" had this figured out and set up. In times of home office, no one wants to be at the office for just pressing a single button on some server.

Oh well, I guess experiences differ.

withinboredom 1486 days ago

My experiences for ops is all pre-2012 and with teams numbering less than 3 for the whole org. So I’m sure things have changed or gotten cheaper? I can’t see a team of 3-4 having the budget to get something that allows them to be “lazy”, especially when that budget can go towards something useful. But I guess the pandemic probably changed things there?

jacquesm 1486 days ago

You can do this kind of thing across the network if you have to.

hansel_der 1486 days ago

no.

it requires acces to the serial console or baseband management controller or whatever terms have emerged.

have never rented a physical server w/o this.

sofixa 1486 days ago

> If you had `unattended-upgrades` running and had the "automatic reboot" option enabled, then all your Ubuntu 20.04 servers running Docker would reboot themselves and not come back up.

Isn't the common wisdom that you should have them enabled, but staggered across hours/days?

gtirloni 1486 days ago

Not a huge Debian/Ubuntu user but I think the systemd timer that triggers the unattended updates has a random delay added to it. I don't know of it's hours or just seconds.

markstos 1486 days ago

I believe it's staggered across hours by default and it seems that Canonical might have been able to at least stop pushing out the bad update even before they had a fix

AtlasBarfed 1486 days ago

Probably better you have rolling A/B replacements that stop the replacement run if the replacement doesn't come up.

This is mostly an in-place upgrade issue?

ConstantVigil 1486 days ago

And while I haven't had this happen to me yet; the fear of something like this or even worse is why I try to stay one step behind the update paths on linux distros.

Security patches matter, but I'm no one important, so I should be fine to wait a week or month...

Anyone else who is important though... servers for example...

rawoke083600 1486 days ago

That sounds downright horrible !