|
I'm still learning my way around caches and memory timing, so if someone better educated than I could correct me if I'm wrong, I'd be very appreciative. That said, here's what I'd say: The core of the issue is with cache-friendly access patterns. This is a simplified explanation, but here goes: CPUs only have so much cache, and the processor needs to keep enough data in that cache that it won't be left waiting for too long before the next batch of data requested from ram is available. To that end, when you access a chunk of memory, the CPU will grab the requested region and then some, hoping that most of your work for the next few microseconds will be within that region. The second implementation jumps by 1024 x sizeof(double) at each access (so, on my system that's 8k), which is plenty far to blow through whatever memory was prefetched with your access, and in so doing force the processor to sit around and wait for another memory location to be cached. There's another, probably less significant way in which this is a pathological access pattern: alignment. When, in the first implementation, the processor operates on contiguous blocks of memory it is free to prefetch memory in nicely sized chunks that begin and end on convenient numbers, which is important because memory today is nothing if not a tower of multiplexed access -- asking for a few extra bits over the edge of a row, wherever those borders happen to be for your system (probably the width of the memory bus is a good guess), may not seem like much, but if it means the memory controller has to access a row of ram that it otherwise wouldn't, that involves first writing the data in the starting row, then waiting for the appropriate delay to save that data before switching rows and repeating the process. Looking at zeroarray2 in that context, you're asking the memory controller to save sizeof(double)*8 bits of zeroes, a fraction of a row, at once before moving on to another row only to revisit the first some time later. Answers to the questions pertaining to memory (rather than concurrency) can all be found in Ulrich Drepper's excellent tour of contemporary memory systems, "What Every Programmer Should Know About Memory" ( http://www.akkadia.org/drepper/cpumemory.pdf ). edit: fixed accidental italics, added useful additional reading link. |