Hacker News new | ask | show | jobs
by rrr_oh_man 758 days ago
Can you elaborate?

Sounds like a hell of a story.

1 comments

At a previous company, we were in the early stages of building a massively distributed simulation platform that would power MMOs and government/military simulations. The platform was written in Scala and used Akka extensively (because of reasons). We had a test environment that spun up a decently big game world, and had a bunch of bots run around and do things. It would run overnight.

At some point it was discovered that every once in a while, bots that were supposed to just go back and forth the entire game world forever would get stuck. It was immediately obvious that they were getting stuck at machine boundaries (the big game world was split into a grid, and different machines would run the simulation for different parts of the grid). This suggested the bug was in the very non-trivial code that handled entity migration between machines.

This was a nightmare to debug. Distributed logging isn't fun. Bugs in distributed systems have a tendency to be heisenbugs. We could reproduce the bug more or less reliably, but sometimes it took hours of running the simulation until it manifested; worse, not manifesting for a few hours wasn't a clear signal that the bug had been fixed.

My investigations were broad and deep. I looked at the Kryo serialization protocols at the byte level. I scrutinized the Akka code we were using for messaging. I rewrote bits and pieces of the migration code in the hope it would fix the bug. Many other engineers also looked at all this and found nothing. A Principal Engineer became convinced this had to be a bug in Scala's implementation of Map. I was very close to giving up multiple times.

At some point there was a breakthrough -- another engineer discovered a workaround. A violent but effective one: flushing every cache and other bits of internal state except the ground truth would get the entities unstuck. We added a button to the debug world viewer appropriately labelled YOLO RESYNC. We were so desperate about this bug, we seriously discussed triggering a YOLO RESYNC periodically.

But if YOLO RESYNC fixed the issue, it meant that there was some sort of problem with the state of the system. I spent some more days and weeks diffing the state before and after YOLO RESYNC (more difficult than it sounds in a not-entirely-deterministic distributed simulation) and narrowed it down more and more until I finally found a very subtle bug in our pubsub implementation. I don't remember exactly what the issue was, but there was some sort of optimization to prevent a message from being sent to a recipient under certain conditions that would "guarantee" the recipient would have gotten the message in some other way -- and the condition was very subtly buggy. Fixing it was an one- or two-line change.

I still remember the JIRA ticket: ENG-168. It tested my sanity and my resilience for longer and harder that anything else before or after.

[EDIT] I saved this ticket as a PDF as a traumatic memory. It was in January 2015 so I got some details wrong, the main one being that it only took about two weeks from the bug report (Jan 28) and the fix (Feb 10). I swear it felt like 3 months.

That was even better than expected.

Thanks for sharing!