Hacker News new | ask | show | jobs
by nkurz 2318 days ago
> I could write a program in assembly that is simply 1000000 triplets of (load, add, store) instructions, each reading from a sequentially-increasing memory location. We could think of it like a fully-unrolled addition loop. My CPU, operating at 3GHz, supposedly should complete this program in ~1ms (3 million instructions running at 3 billion instructions per second), but (spoiler alert) it doesn't. Why?

I'll bite. Why doesn't it? And how long do you expect it to take? I'll claim that with a modern processor a simple loop in C probably beats this speed. If you want, we can test afterward to see if our respective theories are true.

The linked article claims that single-threaded reading speed of sequential memory (on his machine) is 11 GB/s. This means a 3 GHz system has a throughput of a little over 3B per cycle (11/3). This means that ever 3 cycles we can pull down 11B, which should be enough to comfortably finish our 1M loads in 1 ms. With 64-bit integers, it's getting a little tighter, but still should be possible.

I guess on a technicality you might be right, but not in a good way. If you were to fully unroll the assembly, you might be able to slow things down enough such that you were running at less than 1 integer per 3 cycles. A simple loop is going to be faster here. Done right (fused SUB/JNZ) the loop overhead should actually be zero. Depending on what you are doing with the store (in place, or to another sequential array?) I'd guess you'd be able to get down to less than 2 cycles per 32-bit int with simple C, and pretty close to 1 cycle per int if you went all out with software prefetching and huge pages.