Hacker News new | ask | show | jobs
by williamdclt 480 days ago
It's a decent rule of thumb, but it definitely needs some pragmatism. Squashing any error, strangeness and warning can be very expensive in some projects, much more than paying the occasional seemingly-unrelated problem.

But of course it's quasi-impossible to know in advance the likelihood of a given error participating in a future problem, and whether it's cheaper to fix this error ahead or let the problem happen. So it becomes an art more than a science to decide what to focus on.

"fix nothing" is certainly a horrible approach, "fix everything" is often impractical. So you either need some sort of decision framework, or a combination of decent "instinct" (experience, really) and trust from your stakeholder (which comes from many places, including good communication and track record of being pragmatic over dogmatic)

4 comments

> Squashing any error, strangeness and warning can be very expensive in some projects

Strongly disagreed. Strange, unexpected behaviour of code is a warning sign that you have fallen short in defensive programming and you no longer have a mental model of your code that corresponds with reality. That is a very dangerous to be in. Very quickly possible to be stuck in quicksand not too far afterwards.

Depends a lot on the project, I think, as the parent comment suggests.
I feel like these categories are different. Warnings should generally be treated as errors in my book, and all errors should be corrected. But "strangeness" is much more open ended. Sometimes large systems don't behave quite as expected and it might make sense to delay a "fix" until something is actually in need of it. If none of your tests fail then does it really matter?
> If none of your tests fail then does it really matter?

Yes. Absolutely.

You don't believe your software is correct because your tests don't fail. You believe your software is correct because you have a mental model of your code. If your tests are not failing but your software is not behaving correctly, that your mental model of your code is broken.

I agree for small systems. But as they get larger you often can't keep track of every last piece simultaneously. It can also become quite involved to figure out why a relatively obscure thing happened in a particular case.

Consider something like Unreal Engine for example. It's not realistic to expect to have a full mental image of the entire system in such a case.

At least in theory the tests are supposed to cover the observable behavior that matters. So I figure if the tests pass all is well. If I still find something broken then I need to add a test case for it.

> But as they get larger you often can't keep track of every last piece simultaneously

Sure, but then you divide the larger system into smaller components where each team is responsible for one or few of these individual pieces and the chief architect is responsible for making sure of how the pieces are put together.

> At least in theory the tests are supposed to cover the observable behavior that matters. So I figure if the tests pass all is well. If I still find something broken then I need to add a test case for it.

But you sure as hell hope that the engineer working on implementing your database has a decent mental model for the thread safety of his code and not introduces subtle concurrency bugs because his tests are still green. You also hope that he understands that he needs to call fsync to actually flush to data to disk instead of going yolo (systems never crash and disks never fail ). How are you supposed to cover the user observable behavior in this case? You cut off the power supply to your system/plug off your disk while writing to the database and assert that all the statements that got committed actually persisted? And how many times you repeat that test to really convince you that you are not leaving behind a bug that will only happen in production systems say once every three years?

I am only giving database and multithreading as examples because they are the most obvious, but I think the principle applies more generally. Take the simplest piece of code everyone learns to write first thing in uni, quicksort. If you don't have a sufficient mental model for how that algorithm works, what amount of tests will you write to convince yourself that your implementation is correct?

> then you divide the larger system into smaller components where each team is responsible for one or few of these individual pieces and the chief architect is responsible for making sure of how the pieces are put together

And then you have ravioli code in the large. It is not going to make it easier to understand the bigger system, but it will make it harder to debug.

https://en.wikipedia.org/wiki/Spaghetti_code#Ravioli_code

If you can reproduce an error, you can fix it. Do that.

If you cannot reproduce it after a day of trying and it doesn’t happen often, don’t fix it.

Probably not the best examples.

Sqlite famously has more test related code than database code.

Multithreading correctly is difficult enough that multiple specialized modeling languages exist and those are cumbersome enough to use that most people don't. In practice you avoid bugs there by strictly adhering to certain practices regarding how you structure your code.

You mention fsync but that famously does not always behave as you expect it to. Look up fsyncgate just for starters. It doesn't matter how well you think you understand your code if you have faulty assumptions about the underlying system.

Generally you come across to me as overconfident in your ability to get things right by carefully reasoning about them. Of course it's important to do that but I guarantee things are still going to break. You will never have a perfect understanding of any moderately large system. If you believe you do then you are simply naive. Plan accordingly (by which I mean, write tests).

I highly doubt anyone has a mental model of the all the code they're working with. You very often work with code that you kind of understand but not fully.
I obviously meant the code that you own and are responsible for.
Same thing there.
And/or so are your tests.
No, they are not. They are a cheap way of verifying that something hasn't gone wrong, not a proof of correctness.

Tests failing implies the code is incorrect. Tests not failing does not imply that the code is correct.

> Tests not failing does not imply that the code is correct.

I don't think that's what's being suggested. Tests not failing when your code does implies that you are missing test cases. In other words things are underspecified.

Haskell is the extreme example of this. If it successfully compiles then it most likely does exactly what you intended but it might be difficult to get it to compile in the first place.

I'm not saying that passing tests proves the code is correct, I'm saying that if you find a problem with the code that your tests don't pick up, then you should add a test for it.
100% this. If our product does something unexpected, finding out why is top priority. It might be that everything’s fine and this is just a rare edge case where this is the correct behaviour. It might be a silly display bug. Or it might be the first clue about a serious underlying issue.
I mean, it is expensive. It’s just that the alternative might be more so.
> might be

Yes.

Or it might be cheaper.

I remember a project to get us down from like 1500 build warnings to sub 100. It took a long time, generated plenty of bikeshedding, and was impossible to demonstrate value.

I, personally, was mostly just pissed we didn't get it to zero. Unsurprisingly the number has climbed back up since

Could you propose to fail the build based on the number of warnings to ensure it doesn't go up?

I did something similar with spotbugs. There were existing warnings I couldn't get time to fix so I configured the maven to fail if it exceed the level at which I enabled it.

This has the unfortunate side effect that if it drops and no one adjusts the threshold then people can add more issues without failing the build.

> This has the unfortunate side effect that if it drops and no one adjusts the threshold then people can add more issues without failing the build.

Our tests are often written with a list of known exceptions. However, such tests also fail if an exception is no longer needed - with a congratulatory message and a notice that this exception should be removed from the list. This ensures that the list gets shorter and shorter.

I did this with a project that was worked on by a large team (50+ devs, many, many, many kloc) when we first added linting to the project (this was very early 2000s) - we automatically tracked the number of errors and warnings at each build, persisted them, and then failed the build if the numbers went up. So it automatically adjusted the threshold.

It worked really well to incrementally improve things without forcing people to deal with them all the time. People would from time to time make sure they cleaned up a number of issues in the files they happened to be working on, but they didn't have to do them all (which can be a problem with strategies that for example lint only the changed files, but require 0 errors). We threw a small line chart up one one of our dashboards to provide some sense of overall progress. I think we got it down to zero or thereabouts within a year or so.

If you can instead construct a list of existing instances to grandfather in, that doesn't suffer from this problem. Of course many linting tools do this via "ignore" code comments.

That feels less arbitrary than a magic number (because it is!) and I've seen it work.

We used this approach to great effect when we migrated a huge legacy project from Javascript to Typescript. It gives you enough flexibility in the in between stages so you're not forced to change weird code you don't know right away, while enforcing enough of a structure to eventually make it out alive in the end.
Absolutely could!

However, management felt kinda burned because that was a bunch of time and unsurprisingly nobody was measurably more productive afterwards (it turns out those are just shitty code tidbits, but not highly correlated with areas which where it is miserable to make changes. Some of the over-refactorings probably made things harder.

It was a lovely measurable metric, making it an easy sell in advance. Which maybe was the problem idk.

It depends on the language. It's so easy to generate warnings in C/C++, other languages they are rare or it's easy to avoid them.
I guess I've been lucky to work on companies in the past ten or so years, where compiler warnings were treated as errors in the CI. Also been really lucky to use an editor which always highlights warnings and lists them, so I can fix those easily.
Step 1: get it to zero

Step 2: -Werror

Step 3: there is no step 3

Fixing everything is impractical, but I'd say a safer rule of thumb would be to at least understand small strangenesses/errors. In the case of things that are hard to fix - e.g. design/architectural decisions that lead to certain performance issues or what have you - it's still usually not too time consuming to get a basic understanding of why something is happening.

Still better to quash small bugs and errors where possible, but at least if you know why they happen, you can prevent unforeseen issues.

Sometimes it can take a serious effort to understand why a problem is happening and I'll accept an unknown blip that can be corrected by occasionally hitting a reset button occasionally when dealing with third party software. From my experience my opinion aligns with yours though - it's also worth understanding why an error happens in something you've written, the times we've delayed dealing with mysterious errors that nobody in the team can ascribe we've ended up with a much larger problem when we've finally found the resources to deal with it.

Nobody wants to triage an issue for eight weeks, but one thing to keep in mind is that the more difficult it is to triage an issue the more ignorance about the system that process is revealing - if your most knowledgeable team members are unable to even triage an issue in a modest amount of time it reveals that your most knowledgeable team members have large knowledge gaps when it comes to comprehending your system.

This, at least, goes for a vague comprehension of the cause - there are times you'll know approximately what's going wrong but may get a question from the executive suite about the problem (i.e. "Precisely how many users were affected by the outage that caused us to lose our access_log") that might take weeks or months or be genuinely nigh-on-impossible to answer - I don't count questions like that as part of issue diagnosis. And if it's a futile question you should be highly defensive about developer time.

That's very fair - at least with third party software, it can be nigh impossible to track down a problem.

With third party libraries, I've too-often found myself reading the code to figure something out, although that's a frustrating enough experience I generally wouldn't wish on other people.

This. Understand it all least to a level where you can make an effort vs risk/impact trade off. Ideally eliminate all low effort issues and mitigate high risk or high impact issues. But eliminating them all is not usually practical. And besides, most of the high impact/high effort application risk resides in design and not in warnings that come from logs or the compiler.
if there are any warnings I'm supposed to ignore then there are effectively no warnings.

there's nothing pagmatic about it. once I get into the habit of ignoring a few warnings that effectively means all warnings will be ignored