Hacker News new | ask | show | jobs
by lifthrasiir 3973 days ago
This is not the most difficult bug I've ever encountered, but it is definitely one of the most interesting bugs.

I had encountered some seriously incorrect outputs from the application server. The output in question was a function of internal states and current time (rounded to hours, it was kind of "hourly" display). The application server was set to log many input/output pairs so I was able to identify non-trivial amount of such errors, but I was unable to determine the cause. Common causes like memory corruption, time zones (as the business logic heavily depended on the local time), NTP synchronization and even the interpreter bug were considered and then rejected. Finally, after two weeks or some, I tried to simulate the function with varying current time and fixed internal states, and surprisingly a portion (but not all) of output from the past matched to the observed output!

It turned out that glibc `localtime` can misbehave in the way that it ignores the local timezone when it was unable to read `/etc/localtime`, and the Linux box the server was in had some issue on reading that (I never had fully identified it, this read was probably the only disk I/O from that server anyway). In lieu of this finding I have exhaustively and posthumously inspected the past logs; it was determined that the gross error rate was in the order of 10^-4 (!), and the way `localtime` used meant that the error can only alter a portion of the output. Studying the glibc code revealed that setting `TZ` environment variable would disable the UTC fallback, so I did so and the error was gone.

Lesson: Learn your moving parts, even if you don't know them in advance.