Hacker News new | ask | show | jobs
by bertil 916 days ago
I think that’s a great policy as it’s clearly intended to help people when they need it, and get people to unplug when it’s valued by their loved ones.

_However_ (that part is probably best bookmarked until Jan 2nd), it also betrays that your system is brittle and can be broken by a bad commit. Don’t do it because you want people to grind until Dec 24th at 6 pm. Do it because it’s great the rest of the year, too. I’d recommend you look into (or ask me about) feature flags, alerting, and automated roll-backs.

The short version is: there’s a meta-system on top of your release process that can tell (if you are using roll-back not features flags): - commits until xyzsdf are fine; - roll-outs starting from commit abcdef have a 2% error rate, 80% on Android; - revert to xyzsdf, send a message (low-priority, email) to the DevOps on call and the author of abcdef that it happened; - for all commits after abcdef: if there no conflicts with xyzsdf, re-try to roll them out; - if there is a conflict because they were on top or abcdef, send a message (low-priority email) to the authors that there is a conflict.

There are more sophisticated versions that can do things like, if you use feature flags, flagging Android users to use the previous version. Another way to do this is to scale who has access to abcdef gradually: say 1% every hour, and revert if you detect issues.

All those seem daunting to teams that haven’t worked like this before, but it my experience, they love it very fast.

3 comments

We use these systems liberally on other times of the year and no one notices, usually. If they do, downtime and interruption budgets handle this.

/However/, let me counter with the point: Just one of our customer has 8000 FTEs working with our system. During hell-time (aka, December and Christmas shopping and shipping), each of those dudes spends their shift taking customer calls lasting 2-4 minutes, which in turn require a few requests into our systems.

Due to the stress of their customers^2 (because it's Christmas and holidays and such), if an agent of a customer is unable to access our systems, they cannot handle the use case of the customer^2 and that will piss of the customer of the customer.

So if we push a bad change during this time, we're going to piss of hundreds of customers^2 per minute for that one customer alone. Even with a fast automatic rollback, that's a long time during hell-time. And they have people who know how to yell at vendors in nasty ways who don't like that.

I enjoy moving software fast and enabling moving software quickly, but customer focus and customer orientation means to understand when to move slow as well.

And hey, if that means more quiet holidays for the hard working operators on my team, who's gonna complain?

You are a lot more ahead than most companies.

I’ve worked for too many places where the Christmas break was because of a lack of tooling. I’m glad you are two steps ahead.

As the person before mentioned, partial rollouts with separate monitoring would help with that and might be an improvement the other 11 month..

But we are doing the same thing, 2 weeks around Christmas there is please take holidays if you can period where we do not merge any non priority one tickets.. which has not happened yet.

How do you detect errors like this?

What is an error? Is a business logic bug going to be picked up by this process automatically, or is some manual steps involved?

Ie a point of sale app releases an update that automatically halves the amount to charge, but displays the full amount to the merchant in the UI. Unit tests pass (because an engineer made a human mistake). Backend calls are correctly used, no errors thrown, simply the wrong amount is used.

How would this be automatically detected and reverted?

Would anyone writing point of sale software want to risk this over one of the biggest trading periods of the year?

As you point out, it really depends on what is an error. Most of the companies I know of have a Holiday freeze are video games, casual ones, even. Changes are minor fixes and optimization—glitches that a player likely won’t notice, but you want to detect them early to avoid losing your ability to detect more.

Back-end tools are different, and I definitely see reasons other than bugs to not change business logic this month.

Yeah, that model may work for many public facing apps, but probably less so for enterprise systems that are heavy in business logic.
> it also betrays that your system is brittle and can be broken by a bad commit.

Correct. So's yours. So's everyone. You might not know what the bad commit is, you might've fixed a bunch of the other bad commits, but even Google gets taken down by bad commits. Your system is brittle and can be broken by a bad commit.