Hacker News new | ask | show | jobs
by whiatp 486 days ago
This made me think of a series of "war room" meetings I had been part of early in my career. Strangely enough, also a defect revealed when the platform was low on memory. This was also the issue where I learned the value of documenting experiments and results once an investigation has taken a non-trivial amount of time. Not just to show management what you are doing, but to keep track of all the things you have already tried rather than spinning in circles.

The war room meetings were full of managers and QA engineers reporting on how many times they reproduced the bug. Their repro was related to triggering a super slow memory leak in the main user UI. I had the utmost respect for the senior QA engineer who actually listened to us when we said we could repro the issue way faster, and didn't need the twice daily reports on manual repro attempts. He took the meetings from his desk, 20 feet away, visible through the glass wall of the room we were all crammed into. I unfortunately didn't have the seniority to do the same.

Since I can't resist telling a good bug story:

The symptom we were seeing is that when the system was low on memory, a process (usually the main user UI, but not always) would get either a SIGILL at a memory location containing a valid CPU instruction, or a floating point divide by zero exception at a code location that didn't contain a floating point instruction. I built a memory pressure tool that would frequently read how much memory was free and would mmap (and dirty) or munmap pages as necessary to hold the system just short of the oom kill threshold. I could repro what the slow memory leak was doing to the system in seconds, rather than wait an hour for the memory leak to do it.

I wanted to learn more about what was going on between code being loaded into memory and then being executed, which lead me to look into the page fault path. I added some tracing that would dump out info about recent page faults after a sigill was sent out. It turns out all of the code that was having these mysterious errors was always _very_ recently loaded into memory. I realized when Linux is low on memory, one of the ways it can get some memory back is to throw out unmodified memory mapped file pages, like the executable pages of libraries and binaries. In the extreme case, the system makes almost no forward progress and spends almost all of its time loading code, briefly executing it, and then throwing it out for another process's code.

I realized there was a useful looking code path in the page fault logic we would never seem to hit. This code path would check if the page was marked as having been modified (and if I recall correctly, also if it was mapped as executable.) If it passed the check, this code would instruct the CPU to flush the data cache in the address range back to the shared L2 cache, and then clear the instruction cache for the range. (The arm processor we were using didn't have any synchronization between the L1 instruction and L1 data cache, so writing out executable content requires extra synchronization, both for the kernel loading code off disk, as well as JIT compilers.) With a little more digging around. I found the kernel's implementation of scatter gather copy would set that bit. However, our SOC vendor, in their infinite wisdom, made a copy of that function that was exactly the same, except that it didn't set the bit in the page table. Of course they used it in their SDIO driver.