Hacker News new | ask | show | jobs
by nertirs 757 days ago
Bug resolution time depends on how familiar a developer is with the system, how complex the issue is and how impactful the bug is. Not everything can be solved in 24 hours. Not everything has to be solved in 24 hours.

Saying that your developers will solve every problem in 24 hours seems like a toxic pr move.

5 comments

Yes, that was my reaction as well. I'm still traumatized by an insidious bug in a distributed system that took me around 3 months of nearly exclusive work to diagnose and fix (one-line fix, of course). ENG-168. Never forget.

"How to fix bugs in 24h", LOL.

That is ... impressive! My longest was 3 weeks, working exclusively.
Turns out it was only 2 weeks, but it felt like 3 months :) See my other comment.
Can you elaborate?

Sounds like a hell of a story.

At a previous company, we were in the early stages of building a massively distributed simulation platform that would power MMOs and government/military simulations. The platform was written in Scala and used Akka extensively (because of reasons). We had a test environment that spun up a decently big game world, and had a bunch of bots run around and do things. It would run overnight.

At some point it was discovered that every once in a while, bots that were supposed to just go back and forth the entire game world forever would get stuck. It was immediately obvious that they were getting stuck at machine boundaries (the big game world was split into a grid, and different machines would run the simulation for different parts of the grid). This suggested the bug was in the very non-trivial code that handled entity migration between machines.

This was a nightmare to debug. Distributed logging isn't fun. Bugs in distributed systems have a tendency to be heisenbugs. We could reproduce the bug more or less reliably, but sometimes it took hours of running the simulation until it manifested; worse, not manifesting for a few hours wasn't a clear signal that the bug had been fixed.

My investigations were broad and deep. I looked at the Kryo serialization protocols at the byte level. I scrutinized the Akka code we were using for messaging. I rewrote bits and pieces of the migration code in the hope it would fix the bug. Many other engineers also looked at all this and found nothing. A Principal Engineer became convinced this had to be a bug in Scala's implementation of Map. I was very close to giving up multiple times.

At some point there was a breakthrough -- another engineer discovered a workaround. A violent but effective one: flushing every cache and other bits of internal state except the ground truth would get the entities unstuck. We added a button to the debug world viewer appropriately labelled YOLO RESYNC. We were so desperate about this bug, we seriously discussed triggering a YOLO RESYNC periodically.

But if YOLO RESYNC fixed the issue, it meant that there was some sort of problem with the state of the system. I spent some more days and weeks diffing the state before and after YOLO RESYNC (more difficult than it sounds in a not-entirely-deterministic distributed simulation) and narrowed it down more and more until I finally found a very subtle bug in our pubsub implementation. I don't remember exactly what the issue was, but there was some sort of optimization to prevent a message from being sent to a recipient under certain conditions that would "guarantee" the recipient would have gotten the message in some other way -- and the condition was very subtly buggy. Fixing it was an one- or two-line change.

I still remember the JIRA ticket: ENG-168. It tested my sanity and my resilience for longer and harder that anything else before or after.

[EDIT] I saved this ticket as a PDF as a traumatic memory. It was in January 2015 so I got some details wrong, the main one being that it only took about two weeks from the bug report (Jan 28) and the fix (Feb 10). I swear it felt like 3 months.

That was even better than expected.

Thanks for sharing!

>Bug resolution time depends on how familiar a developer is with the system, how complex the issue is and how impactful the bug is

Some huge factors are:

- how fast you can test things (e.g. change and re-run test)

- how good runtime (e.g. debugger) and logging visibility you have

- how good (representative) are the test data

In other words, fast ways to "look into" the matter. Usually if you can iterate fast and get a reproduction, you can also solve it fast.

It's much less common that the issue is a big architectural concern that needs total rethinking/refactoring.

Biggest factor, dwarving all the other factors: can you quickly repro.
Well you could at least start with solving everything in 24 hours that can be solved in 24 hours. More often than not such bugs take days, weeks, months not because of time in IDE, but because of backlogs, prioritization, time in test and longer release cycles. Streamlining that sounds mostly a win to me.

Note that a bug fixed in 24hrs is also a bug that doesn't have to be fixed later. I mean the development work has to be done at some point anyway, and this may even save some time discussing and bouncing around the issue.

Presumptuous of you to assume that my team does not already work like that.
Speaking as one of those developers, we suggested the topic. We are proud of this!

I've worked at several companies you know (https://www.linkedin.com/in/macneale) - and this is the least toxic company I have ever worked at. Hands down. We take pride in running a tight ship.

Cloudflare have been fixing their billing oopsie since Mar 21st.

Some times bugs aren't so much a bug as the anthill.