Hacker News new | ask | show | jobs
by Fr0styMatt 4429 days ago
Worked on a .NET app with lets of C# and C++ interop and we were seeing what looked like a rare race condition. The bug was hard to repro and never seemed to happen in our dev environment (like all great bugs!). The native code would just trigger access violations and we couldn't figure out what triggered it. We ended up getting extremely lucky and getting to a point where the bug started happening on my dev machine while I was running the app under a specific profiling tool.

It turned out we were racing the .NET garbage collector when passing a reference off to some native code. The variable being used by the native code would be collected by .NET as it determined that we no longer needed the variable when we were still actually using it. A classic .NET interop mistake that was fixed by wrapping the use in a 'using' clause. The reason we didn't see it while developing was that .NET garbage collection is different between Debug and Release builds (Debug builds extend the lifetimes of all variables so that they live for the entire length of the function they are used in; Release builds optimize for the shortest variable lifetime).

Another time, had a crash that seemed to occur after a consistent time period. The crash was in third-party code that we didn't have the source for. It 'kind of' behaved like a memory leak, except that everything was telling us the app's memory usage was normal.

After several weeks almost tearing my hair out over it, I ended up setting a breakpoint in the third-party lib using WinDbg and using PageHeap on the app while it was working in order to see who owned the memory that the third-party library was trying to write to. That lead us to a line of code that was performing a malloc() without checking the result; then sending that off to the third party library for use later. The malloc failed, neither us nor the third-party library checked for the NULL and the third-party lib wouldn't try to use the memory until later.

We eventually determined that we were seeing heap fragmentation causing the malloc to fail; we found that under certain conditions we'd leak small objects. Again WinDbg and other tools came in handy; they DID report high address space fragmentation, but it took us a long time to put the two things together. WHen we fixed the small object leak, the crash in the third-party library went away.

So classic bugs in hindsight, but they certainly taught us how awesome WinDbg can be (along with the other excellent debugging facilities provided by Windows). In the end, well worth the time investment in learning.