Hacker News new | ask | show | jobs
by nixgeek 4920 days ago
Holidays are actually one of the best times to be making changes as traffic is significantly lower, and IMO, one should be aiming for an infrastructure where you can always ship changes without being afraid of the ramifications.

Architecturally that may mean many things - hitting "SHIP IT!" might push code into a staging environment for some final testing before delivering it onto a platter in production. Should you have multiple sites, it might involve rolling out the new stuff to just one of them until you see how it goes. Maybe you have feature flags and want to introduce a new change to all servers, but just 1% of the user population?

Fundamentally hitting "SHIP IT!" should be doing just that. Any constraints you put on how fast it gets to 100% of the user population are a risk control, and you need to optimize for a balance of developer happiness and system stability.

When you concede "We can't make changes because we're frozen" outside of a critical systems ('life critical') environment, you should quit your IT job and go become a fisherman or something.

2 comments

Shipping code changes is a different beast.

I am talking about the infrastructure side of things.

I have built large scale percentage-deployment, slice deployment (whatever you want to call them) scenarios like you speak of but modifying an AGG switch that provides connectivity to your entire prod space... Uh.. Go ahead and use your philosophy for managing large infrastructure and I will enjoy my days off thanks.

This change is not a SHIP IT! change. This is a switching infrastructure upgrade. This is not a push from your CI into your rolling rel. system that updates prod applications.

This is an underlying infrastructure change with high impact and high visibility with many stakeholders at risk.

Sorry for any confusion that my, very vague, post caused.

Maybe someday I will become a fisherman. But for now, I will keep these switches and servers up and running with 99.999% uptime. It is what I love to do!

Got any fishing tips?

I guess we have differing viewpoints, I see absolutely no fundamental reason why infrastructure should be treated all that much differently to code. It should be possible to fire off a test suite, to automate its deployment, etc.

I would agree that is not where most folks are at today.

I would argue the far more interesting discussion is how we develop and mature tools to get more folks there in future.

Since not everyone here is ops, if your holiday is going to be potentially impacted by a deployment, you are fully aware of that going into the deployment. We take note of people with blacked out dates (e.g. you booked your flight before we ever started talking about this), and everyone else impacted knows what's on the docket. While the issues are sudden, everyone at least has that nagging feeling that they might get a call to action.

I agree that we should be moving to automated infrastructure testing and stuff like that. To some extent, it may be possible via puppet/chef/auto tools, however, not all infrastructure is like that. Sometimes you have to go physically move stuff at your downtime window, and you can't do redundant wiring (particularly for network). I've been bitten network outages more than anything else, particularly with partial/undetected failures.

I think we're seeing a move to the "treat infrastructure as code" future, such as cluster fileservers (Netapp 8-cluster mode, or Isilon systems). You'll be able to "seamlessly" migrate data around, and virtual interfaces without impacting production. I'm looking forward to seeing how that changes ops.

I love doing major upgrades over Xmas/NY, Easter, and Labor Day. Lower traffic, and much better chances of fixing things if they go wrong.

I've always found that if you compensate people for it, it's not a problem. I personally would prefer 2-3 weeks extra vacation some other time in exchange for working through the holidays. It's way cheaper to travel in Jan/Feb, too.

It depends on your team, though -- if it's a bunch of people with kids who have school holidays at certain times, it might be more of an issue.