| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tialaramex 1927 days ago

> If you want to wake me up a Saturday at 4 am, the world better be on fire.

In a previous role I actually have had things set so that I might be woken at 4am on a Saturday, IIRC specifically under certain conditions it'd play "Straight out of Compton" at full volume on my Hi Fi to ensure my attention, which gave me about 10 seconds before it gets real loud.

I was leak hunting, specifically looking for a huge leak in a production system that we couldn't reproduce on smaller test systems - so I needed to wake up, attempt to diagnose the leak and then (regardless of whether successful or not) mitigate it (kill the bloated process, one transaction fails but everything else will auto-repair) and go back to bed.

But what I don't understand here is, why are you constantly rebuilding and alerting on failure? A CI flag can wait until Monday stand-up, are you auto-deploying any change of state even when there aren't any humans around to cause that? Why? That strikes me as up there with Apache's "But your Good OCSP response expires in 18 hours, so I stapled this newer Bad one instead" in terms of terrible mistakes.

If instead your new builds fail after pruning, the human who is causing a new build can decide what to do about that when it happens, no need for anybody to be woken at 4am.

1 comments

viraptor 1927 days ago

Your Saturday 4am may be someone else's Friday 3pm. Humans may be around and doing their normal work.

link

unionpivo 1926 days ago

Which is why you don't[1] push to production on Friday at 3pm.

[1] As always there are exceptions to this rule

link

viraptor 1926 days ago

That's not a great rule in this case. Unless you also don't push on Monday 1pm because it could be someone's Monday 1am?

link

unionpivo 1925 days ago

I looks like we are talking about different things.

So you have a team (i suppose could be just 1) or teams across few time zones.

Each team should deploy in a way that they have people ready to fix thing if they go bad.

You can do that by having large enough team in similar timezomes (+/- 2-3 hr), or by paying extra for having people on standby at 1am.

My personal opinion is that having large enough team (again could just be 1 guy) in similar time zone is preferable.

Bottom line is, deploying new stuff is always risky. That's why people spend so much time trying to reduce this risk (various test, CI, staged rollout ...). And sometimes all of that still fails, and you need people to either rollback or fix it on the spot.

link