| HN Mirror

As someone on the other side of this, I'm sympathetic and genuinely do try to debug problems, but, off the top of my head,

a) I don't actually have the same level of access to our cluster as our users do. There are datasets and even programs with contractual limitations on who can access it. So if you tell me "My job isn't working," I can't run it myself and see what's wrong; you need to send me the error message. Just like with software, if you can get me a minimal, self-contained example (especially one I can run myself), I can try to figure out why it's breaking, but I can't necessarily minimize your code.

b) Somewhat by definition (a system with "sysadmins" necessarily has enough users to justify paying us), there are a whole lot of other users who don't have whatever problem you have. (We notice very quickly if a problem is affecting everyone.) So chances are high that the answer is "You're holding it wrong" instead of "The tool is broken." Yes, a lot of the time that's bad documentation or bad error messages, which we can and should fix, but the common answer to those questions in practice is your teammate shows you how to hold the tool. The point of a sysadmin is to take advantage of economies of scale; it doesn't scale for us to debug everyone's problems. (And there's a very real sense where time spent helping an individual user is time not spent writing docs or improving error messages.)

I think these problems ought to be solvable, and I'm curious what we (culturally) can do to make this better.

At the somewhat deep technical level, I've been sort of wondering about the nature of errors. Some errors - e.g., statting a file that doesn't exist - are fairly common in working software. Others - e.g., statting a file that you don't have permissions to - ought to be pretty rare. Suppose we had a kernel that could distinguish those, somehow, and sample backtraces or error contexts in some fashion. Would that help us identify problems like this faster, and narrow down quicker on the fact that the system actually isn't working right?