Hacker News new | ask | show | jobs
by Forricide 479 days ago
Fixing everything is impractical, but I'd say a safer rule of thumb would be to at least understand small strangenesses/errors. In the case of things that are hard to fix - e.g. design/architectural decisions that lead to certain performance issues or what have you - it's still usually not too time consuming to get a basic understanding of why something is happening.

Still better to quash small bugs and errors where possible, but at least if you know why they happen, you can prevent unforeseen issues.

2 comments

Sometimes it can take a serious effort to understand why a problem is happening and I'll accept an unknown blip that can be corrected by occasionally hitting a reset button occasionally when dealing with third party software. From my experience my opinion aligns with yours though - it's also worth understanding why an error happens in something you've written, the times we've delayed dealing with mysterious errors that nobody in the team can ascribe we've ended up with a much larger problem when we've finally found the resources to deal with it.

Nobody wants to triage an issue for eight weeks, but one thing to keep in mind is that the more difficult it is to triage an issue the more ignorance about the system that process is revealing - if your most knowledgeable team members are unable to even triage an issue in a modest amount of time it reveals that your most knowledgeable team members have large knowledge gaps when it comes to comprehending your system.

This, at least, goes for a vague comprehension of the cause - there are times you'll know approximately what's going wrong but may get a question from the executive suite about the problem (i.e. "Precisely how many users were affected by the outage that caused us to lose our access_log") that might take weeks or months or be genuinely nigh-on-impossible to answer - I don't count questions like that as part of issue diagnosis. And if it's a futile question you should be highly defensive about developer time.

That's very fair - at least with third party software, it can be nigh impossible to track down a problem.

With third party libraries, I've too-often found myself reading the code to figure something out, although that's a frustrating enough experience I generally wouldn't wish on other people.

This. Understand it all least to a level where you can make an effort vs risk/impact trade off. Ideally eliminate all low effort issues and mitigate high risk or high impact issues. But eliminating them all is not usually practical. And besides, most of the high impact/high effort application risk resides in design and not in warnings that come from logs or the compiler.