Hacker News new | ask | show | jobs
by sporkenfang 3013 days ago
There are two things I don't like here:

- the author seems to call out specific individuals, by name, to their team leads, based on where they were ssh'd in, instead of bringing up the issue with that person first and asking for the logic of why they were on the box doing the thing (sounds like a lot of assumptions combined with finger pointing)

- it sounds like there's no well-configured monitoring or observability at the DC/rack/machine level involved, at all, which is surprising in a modern enterprise setup

3 comments

> the author seems to call out specific individuals, by name, to their team leads, based on where they were ssh'd in, instead of bringing up the issue with that person first and asking for the logic of why they were on the box doing the thing (sounds like a lot of assumptions combined with finger pointing)

I'm not sure why you would assume that. She specifically says "The next thing I'd do is to go get in touch with that person."

Perhaps you're keying off "I've been able to track down some well-meaning but ultimately flawed attempts at fixing things that then blew up and became something much bigger. The folks who I pinged about it were amazed that I somehow had managed to "guess" that a specific member of their team had been poking at a specific box"? But keep in mind, that's specifically events that became a large issue. Is it not appropriate to notify management as to the cause of the issue? Either it's a first time mistake or not something the person may necessarily have known to look out for, in which case management should be lenient, or it's the latest in a string of events and management should possible take some other action.

If nothing else, it allows management for that other team to say "hey, we don't need to be messing with this aspect of the server. Either contact the team whose responsibility it is and get them to do the work, or get them to sign off on it first."

On call person was on a plane. They did something using intermittent connectivity and made a mistake. Other person on the ground is helping until the first one lands. I tell #2 about a box touched by #1 and ask them to have a look.

They did and they figured it out. Outage resolved.

Why did you automatically assume I ratted them out to management? At no point does the story go there.

I’m really curious, since misunderstandings like this can really poison a working environment when people think you’re doing things you’re not. I want to know what sent you down the wrong path here.

Nah, you're fine, OP.

Commenters tend to proclaim bad intentions when none is present when they either skimmed without reading or that they are lashing out to compensate for some weird insecurity, e.g. "I caused an outage once and I didn't want anyone to ask me and find out! How dare you want to know!?"

Thanks. I ask because I’m pretty sure I tripped over this bigtime in recent history.

When all you do is wander around looking for broken stuff to help fix, imagine the above sequence repeating itself.

For whatever reason,

> "I've been able to track down some well-meaning but ultimately flawed attempts at fixing things that then blew up and became something much bigger. The folks who I pinged about it were amazed that I somehow had managed to "guess" that a specific member of their team had been poking at a specific box"?

read like "I didn't like what someone did on a machine I had to troubleshoot, and told their manager" to me.

I was also reading this prior to coffee, mea culpa.

After running w, “The next thing I'd do is to go get in touch with that person. It would be foolish to continue when the answer might be a few short chat messages away.”