Hacker News new | ask | show | jobs
by Twirrim 900 days ago
That Principal Engineer's knowledge came from painful repeated experiences in AWS. When I left AWS in 2016 they were trying to push towards 3 monthly cert rotations, and hoping to get it shorter.

A year long expiry isn't frequent enough that you build automation, and is long enough that the runbook you have is likely out of date before the next time you execute it. If you make it 3 monthly, it's more likely to be fully or mostly automated, and it's more likely you'll remember that certs were recently introduced in a particular service. If you make it monthly, it's pretty much guaranteed that it'll be fully automated.

Almost every week in the weekly AWS-wide ops meetings, one service or another would be talking about something that went wrong that was caused by some certificate expiring, that happened in a place they'd forgotten they had certificates, or had missed when they did the rotation. A number of those failures presented in particularly misleading ways, too, by nature of what role the cert was playing.

2 comments

Does one actually manage to avoid such outages for 10 years by making the problem recur every month? 'cause I feel like stuff would still break even if you test and run them regularly.
You might hit an outage, but you'll hit it within a month of deploying the new code that caused it, so you'll have the context and staffing expertise to fix it so it doesn't happen next month. Whereas if the outage happens in ten years, you'll need some software archaeologists to find the root cause and likely won't have the expertise available to fix it.

And maybe you say "it's one outage either way, but isn't it better in ten years than next month?" But when you're constantly adding new services, eventually there will come a time where every month some new service is having its ten year anniversary.

Sounds like they need a systems that actually gets remembered and referenced if they want to stick to 1 year expiries.