Hacker News new | ask | show | jobs
by ewokhead 4927 days ago
Shipping code changes is a different beast.

I am talking about the infrastructure side of things.

I have built large scale percentage-deployment, slice deployment (whatever you want to call them) scenarios like you speak of but modifying an AGG switch that provides connectivity to your entire prod space... Uh.. Go ahead and use your philosophy for managing large infrastructure and I will enjoy my days off thanks.

This change is not a SHIP IT! change. This is a switching infrastructure upgrade. This is not a push from your CI into your rolling rel. system that updates prod applications.

This is an underlying infrastructure change with high impact and high visibility with many stakeholders at risk.

Sorry for any confusion that my, very vague, post caused.

Maybe someday I will become a fisherman. But for now, I will keep these switches and servers up and running with 99.999% uptime. It is what I love to do!

Got any fishing tips?

1 comments

I guess we have differing viewpoints, I see absolutely no fundamental reason why infrastructure should be treated all that much differently to code. It should be possible to fire off a test suite, to automate its deployment, etc.

I would agree that is not where most folks are at today.

I would argue the far more interesting discussion is how we develop and mature tools to get more folks there in future.

Since not everyone here is ops, if your holiday is going to be potentially impacted by a deployment, you are fully aware of that going into the deployment. We take note of people with blacked out dates (e.g. you booked your flight before we ever started talking about this), and everyone else impacted knows what's on the docket. While the issues are sudden, everyone at least has that nagging feeling that they might get a call to action.

I agree that we should be moving to automated infrastructure testing and stuff like that. To some extent, it may be possible via puppet/chef/auto tools, however, not all infrastructure is like that. Sometimes you have to go physically move stuff at your downtime window, and you can't do redundant wiring (particularly for network). I've been bitten network outages more than anything else, particularly with partial/undetected failures.

I think we're seeing a move to the "treat infrastructure as code" future, such as cluster fileservers (Netapp 8-cluster mode, or Isilon systems). You'll be able to "seamlessly" migrate data around, and virtual interfaces without impacting production. I'm looking forward to seeing how that changes ops.