| I was dealing with a hard problem earlier this week, which I'm pretty sure was causing a thread to crash without logging anything, but the program to stay running. Unfortunately, only seen in production and only once every few days. The program does several stages of data processing in parallel batches, initially loading and eventually saving to a database. It's basically a "continuous" and complicated ETL. There is effectively a set of global state variables to track progress of each input item through the stages. The values in this global state can depend on the data, execution order, and can be modified from a dozen places in the code. I narrowed down several potential crash points, which was basically stuff like: if the global state contains x and a db lookup in thread 2 times out, if thread 3 accesses the value before 2 starts the next batch it could get a null reference. Another was based on making a decision to insert or update: in theory, the two global state value that effectively made this decision could never be set to states where it would do the wrong thing (getting either a foreign or duplicate key error) but the state is possible to represent. If I were to run in a debugger using the massive production data stream I might eventually get lucky and see the data that triggers this. However, I could also sit for days and get nowhere, or the act of debugging and inspecting night be enough to prevent a race condition and not trigger a bug. I still don't know for sure what's happening (though now there's instrumentation and better error handling in those spots so hopefully I will), but the point here is it's nearly impossible to reason about in a definitive way. |