Hacker News new | ask | show | jobs
by jwestbury 901 days ago
One of the principal engineers I used to work with at AWS had a saying: "A one-year certificate expiration is an outage you schedule a year in advance." Of course, it's a bit hyperbolic -- but a ten-year expiration is almost a certainty to result in an outage.

In a similar vein, you should never generate resources which will expire unless some undocumented action is taken. A common one I've seen is self-signed certs which last for n days, and are re-generated whenever an application is deployed or restarted, under the assumption that the application will never run untouched longer than that. (Spoiler: It probably will, at some point, whether due to unexpected change freezes, going into maintenance mode, or -- in my personal favourite -- being deployed to an environment that just isn't updated as regularly.)

4 comments

That Principal Engineer's knowledge came from painful repeated experiences in AWS. When I left AWS in 2016 they were trying to push towards 3 monthly cert rotations, and hoping to get it shorter.

A year long expiry isn't frequent enough that you build automation, and is long enough that the runbook you have is likely out of date before the next time you execute it. If you make it 3 monthly, it's more likely to be fully or mostly automated, and it's more likely you'll remember that certs were recently introduced in a particular service. If you make it monthly, it's pretty much guaranteed that it'll be fully automated.

Almost every week in the weekly AWS-wide ops meetings, one service or another would be talking about something that went wrong that was caused by some certificate expiring, that happened in a place they'd forgotten they had certificates, or had missed when they did the rotation. A number of those failures presented in particularly misleading ways, too, by nature of what role the cert was playing.

Does one actually manage to avoid such outages for 10 years by making the problem recur every month? 'cause I feel like stuff would still break even if you test and run them regularly.
You might hit an outage, but you'll hit it within a month of deploying the new code that caused it, so you'll have the context and staffing expertise to fix it so it doesn't happen next month. Whereas if the outage happens in ten years, you'll need some software archaeologists to find the root cause and likely won't have the expertise available to fix it.

And maybe you say "it's one outage either way, but isn't it better in ten years than next month?" But when you're constantly adding new services, eventually there will come a time where every month some new service is having its ten year anniversary.

Sounds like they need a systems that actually gets remembered and referenced if they want to stick to 1 year expiries.
One day I could not connect to my (home) server. Turns out the client certificate had expired, I never thought to make note of or increase the 10 year default value when I did my test configuration...
I remember there being a weird clock rollover bug that only financial firms would hit (since they never took their machines down, ever)

That was a long time ago. I wonder if technology/the cloud has changed or they still run those same machines

30 years ago companies were rebooting their mainframes twice a year just to make sure. Before doing that companies were burned because the mainframe went down accidentally (backup generator broke during a power outage) and they couldn't get it to start because someone changed a setting at runtime but didn't save the setting to the boot scripts - then that person retired or found a new job. By rebooting twice a year they were able to ensure the someone remembered what setting was changed when the system failed to start.
Chaos Engineering!

Untested emergency plans are not a guarantee that the plans will work.

One of the things that I loved about ISO9001, sure, it made every sysadmin action something that made police paperwork look 'light', but it ensured you didn't hit this kind of thing, or if you did, it was an instant gross negligence dismissal on whoever stopped documenting or following the documented procedural protocol.
Financial firms will also hit time-based bugs before most organizations because they often deal with forecasting events 30+ years in the future (e.g. mortgages). For a bank, the 2038 rollover has been relevant since 2008.
I hit one of these on an EMC VNX array one time; after ~400 days all the controllers crashed at the same time. Didn't help that it happened at 4am on New Year's Day. I do recall other instances of this class of bug, but nothing specific.
I had to do a release to fix an outage because someone set up a system that would have an outage every six months if no one ran a release.

Naturally, they didn't document this.