Hacker News new | ask | show | jobs
by ewokhead 4917 days ago
Note to Github:

Freeze prod changes two weeks before and two weeks after all major holidays.

Your employees probably don't appreciate the hassle when all they are thinking about is "YEAH! DAYS OFF!"

Just my opinion and how I run my systems in the DC.

7 comments

Holidays are actually one of the best times to be making changes as traffic is significantly lower, and IMO, one should be aiming for an infrastructure where you can always ship changes without being afraid of the ramifications.

Architecturally that may mean many things - hitting "SHIP IT!" might push code into a staging environment for some final testing before delivering it onto a platter in production. Should you have multiple sites, it might involve rolling out the new stuff to just one of them until you see how it goes. Maybe you have feature flags and want to introduce a new change to all servers, but just 1% of the user population?

Fundamentally hitting "SHIP IT!" should be doing just that. Any constraints you put on how fast it gets to 100% of the user population are a risk control, and you need to optimize for a balance of developer happiness and system stability.

When you concede "We can't make changes because we're frozen" outside of a critical systems ('life critical') environment, you should quit your IT job and go become a fisherman or something.

Shipping code changes is a different beast.

I am talking about the infrastructure side of things.

I have built large scale percentage-deployment, slice deployment (whatever you want to call them) scenarios like you speak of but modifying an AGG switch that provides connectivity to your entire prod space... Uh.. Go ahead and use your philosophy for managing large infrastructure and I will enjoy my days off thanks.

This change is not a SHIP IT! change. This is a switching infrastructure upgrade. This is not a push from your CI into your rolling rel. system that updates prod applications.

This is an underlying infrastructure change with high impact and high visibility with many stakeholders at risk.

Sorry for any confusion that my, very vague, post caused.

Maybe someday I will become a fisherman. But for now, I will keep these switches and servers up and running with 99.999% uptime. It is what I love to do!

Got any fishing tips?

I guess we have differing viewpoints, I see absolutely no fundamental reason why infrastructure should be treated all that much differently to code. It should be possible to fire off a test suite, to automate its deployment, etc.

I would agree that is not where most folks are at today.

I would argue the far more interesting discussion is how we develop and mature tools to get more folks there in future.

Since not everyone here is ops, if your holiday is going to be potentially impacted by a deployment, you are fully aware of that going into the deployment. We take note of people with blacked out dates (e.g. you booked your flight before we ever started talking about this), and everyone else impacted knows what's on the docket. While the issues are sudden, everyone at least has that nagging feeling that they might get a call to action.

I agree that we should be moving to automated infrastructure testing and stuff like that. To some extent, it may be possible via puppet/chef/auto tools, however, not all infrastructure is like that. Sometimes you have to go physically move stuff at your downtime window, and you can't do redundant wiring (particularly for network). I've been bitten network outages more than anything else, particularly with partial/undetected failures.

I think we're seeing a move to the "treat infrastructure as code" future, such as cluster fileservers (Netapp 8-cluster mode, or Isilon systems). You'll be able to "seamlessly" migrate data around, and virtual interfaces without impacting production. I'm looking forward to seeing how that changes ops.

I love doing major upgrades over Xmas/NY, Easter, and Labor Day. Lower traffic, and much better chances of fixing things if they go wrong.

I've always found that if you compensate people for it, it's not a problem. I personally would prefer 2-3 weeks extra vacation some other time in exchange for working through the holidays. It's way cheaper to travel in Jan/Feb, too.

It depends on your team, though -- if it's a bunch of people with kids who have school holidays at certain times, it might be more of an issue.

I guess one of the counter-arguments to this (very good) suggestion, is that holidays are probably the quietest time in terms of traffic and usage.
It's not just the employees - if you provide a service that your customers depend on to provide their service, they're not going to appreciate the hassle on a holiday especially when it's not at all their fault. This isn't true so much for GitHub, but is for infrastructure providers like AWS.
Tell that to rackspace not sure how many where effected on Christmas Day they decided to update a router - causing a 3-1/2 hr outage from 8:30 est to 11:00est aprox, took out part of my infrastructure - made for a nice Christmas morning surprise. Effected the ORD datacenter- only thing I could on RS status is https://status.rackspace.com/index/viewincidents?group=2&.... And of course the ticket in my account
This is (or was) the MSN way. No releases over winter holidays.
A lot of people replying to this are talking about how holidays are their low period. I work in mobile; the Christmas holiday is the busiest time of the year for us because that's when everyone gets their new devices and downloads apps and has all day to play with them.

Github's a service provider and I would be surprised if none of their customers didn't have peak traffic over Christmas for this or a similar reason.

In any case, I agree with nixgeek; you shouldn't ever expect outages when upgrading infrastructure. On the other hand, you also have to measure the pain of fixing an outage, and that's likely to be higher during the holidays because of availability.

Stopping coders from deploying stuff due to a risk of an unspecified "something" going bad makes for frustrated employees. By that time you might as well give them the days off as they'll have little to strive toward. You don't need to freeze all code deployments and other things that have little risk.

Also, they most likely scheduled this at this time due to the lower traffic, probably the lowest of the year for them. While half of Github was probably enjoying their families the other half was planning for this for a long time. Given the size of the operation I don't think anyone took it lightly.

Switch modifications should never stop work.

Prod code push != prod infrastructure changes. Which is what the article is talking about. Specifically the agg. switching layer.

My reply is not about code deployments. It is about managing network devices with high visibility and impact.

I still stand by my original comment with a critical detail added:

Freeze prod ~infrastructure~ changes two weeks prior and two weeks after major holidays.

Push code all you want.

The RFO that they provided addresses link aggregation changes which are a part of an infrastructure change.

One month of no deploys seems pretty high for an organization that does upwards of a hundred deploys a day. 2 days seems far more reasonable for the organizational philosophy they're going for.
Yep, it is a long time which is why I mean to speak specifically on the infrastructure side of things. Modifying the agg layer providing connectivity for all prod systems during a holiday weekend would suck.

When I say prod, I mean prod infrastructure. Not code.

Sorry for the confusion.

Code changes are not the same as infrastructure changes.
Well he did say "prod changes", which I would take to include all changes to the production environment, code and otherwise.