Hacker News new | ask | show | jobs
by shiftpgdn 4400 days ago
Let this be a lesson to linux admins. Re-alias shutdown -r now into something else on production servers. I once took down access to about 6000 servers because I ran the script to decommission servers on our jump box when I got the SSH windows confused.
5 comments

At one point, I worked in a computer lab that was mostly Ultrix machines. The shutdown grace period was specified in minutes ( http://www.polarhome.com/service/man/generic.php?qf=shutdown... )

Then we got a hp-ux machine in the lab. For some reason, the grace period on that system was in seconds ( http://www.polarhome.com/service/man/generic.php?qf=shutdown... )

System dax shutting down in 5 seconds.

Cheers for this! Would have saved me so much grief before. Now going around and installing it on the servers I manage (fortunately nothing mission critical, but many remote).
"when I got the SSH windows confused"

I've come close to that as well.

This reminds me of the paradox of being competent vs. a beginner.

It also has parallels in a few thing outside computing.

Beginners make different mistakes because they don't know enough to go quickly.

Once you are experienced you fly, similar to the way you drive in a trance without thinking some times.

With power tools I've seen this as well. You tend to take more chances the more experience you have (or even in my case getting cut with an exacto knife). Someone using a saw for the first time is going to go slowly and follow the directions (of course there are other types of safety mistakes they could make for sure..)

While a newbie might do rm -fr directory * instead of rm -fr directory* an experienced user could do that as well [1] simply by going to fast and not thinking "hey I'm doing something dangerous let me slow down and check before I auto hit return".

[1] I typically do

for i in something* do echo $i done

Then if I like what I see I will up arrow and insert "rm -fr $i" after the echo. Or maybe a read x to pause in between.

(Note: I'm not a sysadmin but I've done over many years sysadmin tasks because it is kind of relaxing in a way..)

I once put `shutdown -h now` (halt) instead of `shutdown -r now` (reboot)

Once I realized what had happened on the production server I ended up calling OVH (and they were helpful but not immediately acting).

It's not a good feeling.

I tend to use /sbin/reboot instead, it amounts to the same (calls shutdown), but it's harder to get it mixed up.
This happened to me once; I don't know if this works on all linux distros but if you quickly follow a halt/shutdown with a "sudo init 6"(reboot) before your ssh-session gets SIGTERMed/KILLed, the box comes back up. This at least worked on some Ubuntu version a few years back.

Give it a try on some system that's not critically important :)

Yeah, but the problem is when you honestly didn't realize calling a halting shutdown until the server doesn't come back 5 minutes later and then you review the terminal
A similar case happened with the Eve Online cluster (~50,000 concurrent users) a couple of years ago. A programmer, who for some reason had access to the live cluster, confused his local development instance with that of the live cluster and issued a shutdown. Luckily they were able to avert the incoming disaster in time (it was a timed shutdown), but jokes are still made about the mistake.

http://oldforums.eveonline.com/?a=topic&threadID=1232785