Hacker News new | ask | show | jobs
by jon-wood 1195 days ago
Assuming they’ve still got access to the servers themselves via SSH, you’d start by issuing a new root CA cert for the Puppetmaster and putting that in place, then you’ve got to issue a new cert for every client and distributing those. It’s not impossible, but it’s also going to be a pain in the backside to do.
2 comments

If you read through the guide [1] it requires you to have sudo access to bounce the puppet process on the client nodes.

This is because the whole idea is that you have inaccessible, locked down Production servers that only Puppet (which is driven from a central, governed configuration management source) has authority to configure i.e. no SSH and no root access.

Thus leaving the only option being to physically visit each server at the datacenter and issue the commands.

[1] https://www.puppet.com/docs/puppet/5.5/ssl_regenerate_certif...

Been there before, we did exactly this; except over OOB+reboot-into-single-user (because SELinux). Took us a few days (~5k servers) but managed to get out of it with no public-facing downtime. The other way would have just been to rekick the world one box at a time. A number of integration tests were added after that disaster :)
According to this [1] Twitter has 500,000+ servers spread across DCs, GCP and AWS.

Which if we assume only a team of your size remains then it would take 300+ days.

That would mean no OS patches etc which would put them firmly in the crosshairs of the FTC.

[1] https://twitter.com/d_feldman/status/1562265193249390593

Interesting. Considering the number of MAU to be around 350 million, that's a bit fewer than 1000 persons per server. Of course it's not that simple, because not all servers are the same and more importantly not all users are the same, but it sounds like a bit on the low end.

Anecdotal point of data: infosec.exchange hosts 30k users on 7 servers (https://infosec.exchange/@jerry/109374478717918484). That's a 1:5 ratio. Again, not the same usage and performance requirements, but I find it interesting.

If you are split-cloud under a homogenous puppet master without homogeneous break-glass SSH access (which would be crazy) then probably your best bet is to just re-kick the world. But the scaling factor for this sort of thing is most certainly not team size; it's "how many X servers can be down at the same time", which will increase with your number of servers. In any case I think the FTC is the least of twitter's concerns right now.
Not sure if it's still the case but last time I had co-located servers you could access the systems via OOB without needing to reboot them in single user mode.

If it's not the case then Twitter is definitely in far more trouble because according to past engineers at least a few of their services needed manual intervention on a full scale reboot. And losing quorum in a distributed system is never pretty.