Hacker News new | ask | show | jobs
by 8organicbits 1197 days ago
I'll take the rumor with a grain of salt, but can anyone unpack what the recovery plan would be for something like this? It would obviously be a big problem, but where would you even start?
4 comments

Assuming they’ve still got access to the servers themselves via SSH, you’d start by issuing a new root CA cert for the Puppetmaster and putting that in place, then you’ve got to issue a new cert for every client and distributing those. It’s not impossible, but it’s also going to be a pain in the backside to do.
If you read through the guide [1] it requires you to have sudo access to bounce the puppet process on the client nodes.

This is because the whole idea is that you have inaccessible, locked down Production servers that only Puppet (which is driven from a central, governed configuration management source) has authority to configure i.e. no SSH and no root access.

Thus leaving the only option being to physically visit each server at the datacenter and issue the commands.

[1] https://www.puppet.com/docs/puppet/5.5/ssl_regenerate_certif...

Been there before, we did exactly this; except over OOB+reboot-into-single-user (because SELinux). Took us a few days (~5k servers) but managed to get out of it with no public-facing downtime. The other way would have just been to rekick the world one box at a time. A number of integration tests were added after that disaster :)
According to this [1] Twitter has 500,000+ servers spread across DCs, GCP and AWS.

Which if we assume only a team of your size remains then it would take 300+ days.

That would mean no OS patches etc which would put them firmly in the crosshairs of the FTC.

[1] https://twitter.com/d_feldman/status/1562265193249390593

Interesting. Considering the number of MAU to be around 350 million, that's a bit fewer than 1000 persons per server. Of course it's not that simple, because not all servers are the same and more importantly not all users are the same, but it sounds like a bit on the low end.

Anecdotal point of data: infosec.exchange hosts 30k users on 7 servers (https://infosec.exchange/@jerry/109374478717918484). That's a 1:5 ratio. Again, not the same usage and performance requirements, but I find it interesting.

If you are split-cloud under a homogenous puppet master without homogeneous break-glass SSH access (which would be crazy) then probably your best bet is to just re-kick the world. But the scaling factor for this sort of thing is most certainly not team size; it's "how many X servers can be down at the same time", which will increase with your number of servers. In any case I think the FTC is the least of twitter's concerns right now.
Not sure if it's still the case but last time I had co-located servers you could access the systems via OOB without needing to reboot them in single user mode.

If it's not the case then Twitter is definitely in far more trouble because according to past engineers at least a few of their services needed manual intervention on a full scale reboot. And losing quorum in a distributed system is never pretty.

It's nearly impossible to predict recovery without understanding the system. You would probably need to know how ssh is configured, how secrets are managed, and how files are distributed, both before and after puppet.

Circular dependencies can absolutely wreck you. For example, puppet could configure sudoers, and without puppet config being applied people who would normally expect access might not have it. So now you have to find a privileged ssh key for un-configured machines.

I would be surprised if twitter did not have a physical vault with a USB drive with a root SSH key on it. With that you can do just about everything.

I would be most terrified of machine churn. Auto-remediation systems or elastic capacity systems can result in lost capacity that can't come back until the configuration problem is resolved.

Create new root CA, ssh to machine, remove old certs, re-add machine to Puppet, sign the new CSR on Puppet master, then it will download new root.

Very simple operation... if you have working SSH access with root. If they don't, well...

If you don't have ssh access with root, hopefully you have access to something like the underlying hypervisor, to do the equivalent of "sudo xl console vmname" on a xen dom0 to get what is logically the same as a physical serial tty (or local vga+keyboard) console on the domU machine.

Or the VMware esxi emulated graphical console, etc.

Or if it's a bunch of bare metal machines, hopefully someone old-school in the organization thought to deploy 48/96-port rs232 console serial concentrators and wire them up to the db9 serial port on each physical server. And you didn't disable all local serial tty in your operating config.

To my knowledge all modern DCs have out-of-band networks for this sort of thing that provide serial access to the BMC chip, nothing old school about that. Old school is having to submit a ticket to Jerry in the DC to walk the crash cart down to box 55AE, hook up a serial console, run diagnostics, and attach the output back to the ticket. You only have to deal with Jerry occasionally now, usually when the BMC or power rails fail.
There's more than a few people who've decided the security risk of full console capable bmc is not acceptable - and if other fail over systems are engineered appropriately, not necessary at all. BMC/IPMI intentionally disabled/not connected to any network.

Anecdotally I have seen a number of low cost x86-64 pseudo blade setups similar to open compute platform design stuff which have no oob. If a unit fails it's pulled entirely and put in a work queue for someone to repair.

In both cases it's disruptive event as you have to reboot the machine to get into rescue mode (as you don't need the password)
> Or if it's a bunch of bare metal machines, hopefully someone old-school in the organization thought to deploy 48/96-port rs232 console serial concentrators and wire them up to the db9 serial port on each physical server. And you didn't disable all local serial tty in your operating config.

In a hacker folklore story this would 100% be the solution. And for some reason they'd have to use an original VT100 that some greybeard had lovingly restored at home.

If they're in the cloud, it's pretty straightforward to re-mount the drive somewhere else and replace the SSH keys.
And if they have same template everywhere probably not even too hard to script.
Depends how disposable the individual servers are. I don't know specifics of the Twitter infra, but I would probably just issue a new cert and begin shooting and replacing the old servers. Hopefully the services are abstracted from the puppet cert and things like Redis and whatnot will safely reprovision and find their quorums.