| HN Mirror

On high-performance code for an embedded PPC system I used to work on, we made all our control block a multiple of the L1 cache width. Our allocation routines then all had inline assembler to run the dcbz instruction (data cache block zero) on all the cache blocks for the control block as it was allocated. This meant the control block was always zeroed, and the memory bus wasn't touched in order to do so. Yes, things were evicted from the cache, but since we're about to start writing things into the control block, the lack of fetch was a net gain.