Hacker News new | ask | show | jobs
by Hankenstein2 1827 days ago
This was interesting but probably not the way the author intended. Lately I feel like I have been spending a significant part of my development time creating small scripts like this for the sole purpose of convincing sys-admins that the problem is actually theirs.

I absolutely believe that sys-admins are as stressed and worked thin as the rest of us and systems in general are worse off because of it but I have always been fascinated/irritated by the assumption that sys-admins are right until proven wrong.

1 comments

As someone on the other side of this, I'm sympathetic and genuinely do try to debug problems, but, off the top of my head,

a) I don't actually have the same level of access to our cluster as our users do. There are datasets and even programs with contractual limitations on who can access it. So if you tell me "My job isn't working," I can't run it myself and see what's wrong; you need to send me the error message. Just like with software, if you can get me a minimal, self-contained example (especially one I can run myself), I can try to figure out why it's breaking, but I can't necessarily minimize your code.

b) Somewhat by definition (a system with "sysadmins" necessarily has enough users to justify paying us), there are a whole lot of other users who don't have whatever problem you have. (We notice very quickly if a problem is affecting everyone.) So chances are high that the answer is "You're holding it wrong" instead of "The tool is broken." Yes, a lot of the time that's bad documentation or bad error messages, which we can and should fix, but the common answer to those questions in practice is your teammate shows you how to hold the tool. The point of a sysadmin is to take advantage of economies of scale; it doesn't scale for us to debug everyone's problems. (And there's a very real sense where time spent helping an individual user is time not spent writing docs or improving error messages.)

I think these problems ought to be solvable, and I'm curious what we (culturally) can do to make this better.

At the somewhat deep technical level, I've been sort of wondering about the nature of errors. Some errors - e.g., statting a file that doesn't exist - are fairly common in working software. Others - e.g., statting a file that you don't have permissions to - ought to be pretty rare. Suppose we had a kernel that could distinguish those, somehow, and sample backtraces or error contexts in some fashion. Would that help us identify problems like this faster, and narrow down quicker on the fact that the system actually isn't working right?

All of those are great points and I agree, I just find myself, more often lately, exhaustively trying to prove my bug is real before something gets fixed.

I wish there were some sort of badges I could acquire, like "You have earned 5 bugs to be fixed, without being a dumbass" badge. And then my 6th one might get escalated earlier.

Like I said, I really appreciate both sides of the issue and am also not certain how to make it better.

The idea with badges is awesome! Kind of like a trusted traveler program, but for sysops.