| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jon-wood 1195 days ago
	Assuming they’ve still got access to the servers themselves via SSH, you’d start by issuing a new root CA cert for the Puppetmaster and putting that in place, then you’ve got to issue a new cert for every client and distributing those. It’s not impossible, but it’s also going to be a pain in the backside to do.

2 comments

threeseed 1195 days ago

If you read through the guide [1] it requires you to have sudo access to bounce the puppet process on the client nodes.

This is because the whole idea is that you have inaccessible, locked down Production servers that only Puppet (which is driven from a central, governed configuration management source) has authority to configure i.e. no SSH and no root access.

Thus leaving the only option being to physically visit each server at the datacenter and issue the commands.

[1] https://www.puppet.com/docs/puppet/5.5/ssl_regenerate_certif...

link

justsomeadvice0 1195 days ago

Been there before, we did exactly this; except over OOB+reboot-into-single-user (because SELinux). Took us a few days (~5k servers) but managed to get out of it with no public-facing downtime. The other way would have just been to rekick the world one box at a time. A number of integration tests were added after that disaster :)

link

threeseed 1195 days ago

According to this [1] Twitter has 500,000+ servers spread across DCs, GCP and AWS.

Which if we assume only a team of your size remains then it would take 300+ days.

That would mean no OS patches etc which would put them firmly in the crosshairs of the FTC.

[1] https://twitter.com/d_feldman/status/1562265193249390593

link

rakoo 1195 days ago

Interesting. Considering the number of MAU to be around 350 million, that's a bit fewer than 1000 persons per server. Of course it's not that simple, because not all servers are the same and more importantly not all users are the same, but it sounds like a bit on the low end.

Anecdotal point of data: infosec.exchange hosts 30k users on 7 servers (https://infosec.exchange/@jerry/109374478717918484). That's a 1:5 ratio. Again, not the same usage and performance requirements, but I find it interesting.

link

justsomeadvice0 1195 days ago

If you are split-cloud under a homogenous puppet master without homogeneous break-glass SSH access (which would be crazy) then probably your best bet is to just re-kick the world. But the scaling factor for this sort of thing is most certainly not team size; it's "how many X servers can be down at the same time", which will increase with your number of servers. In any case I think the FTC is the least of twitter's concerns right now.

link

threeseed 1195 days ago

Not sure if it's still the case but last time I had co-located servers you could access the systems via OOB without needing to reboot them in single user mode.

If it's not the case then Twitter is definitely in far more trouble because according to past engineers at least a few of their services needed manual intervention on a full scale reboot. And losing quorum in a distributed system is never pretty.

link