| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stephengillie 4447 days ago
	As a Devops, I can't justify building any automated way to down or restart all of my systems at once. We've only had to do that to resolve router reconvergence storms when changing out (relatively) major infrastructure pieces, such as our Juniper router.

3 comments

michaelt 4447 days ago

You don't intentionally build an automated way to take down all your servers at once.

You build a way to automatically perform some mundane standard procedure, like propagating a new firewall rule to all your systems at once. Then you accidentally propagate a rule that blocks all inbound ports. Huh, when I tested locally I didn't notice that.

Or you build a way to automatically delete timestamped log files more than a month old. And when it runs in production, it also deletes critical libraries which have the build timestamp in their filename. Ah, the test server was running a nightly build instead of a release so the files were named differently.

Or you build a way to automatically deploy the post-heartbleed replacement certificates to all your TLS servers, and only after you do that you find you didn't deploy the replacement corporate CA certificate to all the clients. Hmm, the test environment has a different CA arrangement, so testers don't get the private keys of prod certificates.

Or you build a way to retain timestamped snapshots of all your files, every five minutes, so you can roll back anything - then find that huge log file that constantly changes gets snapshotted every time, and everything is hanging because of lack of disk space. Oh, production does get a lot more traffic to log, now I think about it.

Or you do any of a hundred other things that seem like simple, low risk operations until you realise they aren't.

link

codexon 4447 days ago

Once I typed

  rm -rf logs_ *

instead of

  rm -rf logs_*

link

linker3000 4447 days ago

Our less-than-savvy Financial Director took it upon himself to restore from tape the bought ledger files to a live system after a slight mishap. Unfortunately, the bought ledger files all started with a 'b' and he managed to restore them to the root of the -nix system instead of the right place, so he mv'd b* to the right location.

All was well until a scheduled maintenance restart a few weeks later and we (eventually) discovered that /boot and /bin were AWOL.

Edit: He had access to the root account to maintain the accounts app (not my call)

link

knodi 4447 days ago

I have nightmares about such things.

link

akerl_ 4447 days ago

Unfortunately, the same tools that allow someone to automate management of systems can easily become catastrophic.

As one of the other commenters noted, a ~20 character salt command will do this. I doubt Joyent built a Big Red Button to take down a datacenter, I expect this will be the case of somebody missing an asterisk or omitting a crucial flag while trying to do their normal work.

link

tommu 4447 days ago

Sorry - are you telling us you had to reboot all nodes because you swapped a router out? Sounds like you need a network engineer.

link

tommu 4447 days ago

And I'm being downvoted for that? Seriously? In 13 years of networking I have never once had to reload machine to help with OSPF or BGP convergence. Good networking architecture and planning should mitigate anything other than a couple of minute outage. No routing change should ever require a reload of a server or end node.

link

cpayne 4447 days ago

I believe you were down voted not for what you said, but the way you have said it.

I've been down voted several times for (what I see) as relatively minor remarks. The HN readers are a sensitive bunch...

link

stephengillie 4447 days ago

Those who are still posting on HN are orders of magnitude more sensitive than those who post on Imgur. The communities are similar-size, yet Imguraffes are much, much more accepting of my comments. What merits a handful of upvotes there brings a downvote or two on this site.

link

stephengillie 4447 days ago

You're assuming my management has been paying for good networking architecture for the past dozen years.

link

tommu 4447 days ago

I believe it. Networking is seen as a commodity now. It's transparent until it fails. There's a whole lot of technical debt lurking out there. I personally have seen the dark shadow of spanning tree suck the light out of DevOp engineers eyes.

link