|
|
|
|
|
by Tunabrain
3424 days ago
|
|
Any entry level optimization course will make you implement something like that; do a matrix+matrix addition, but traverse it in row-major order in one program, and column-major in the other. If the matrix is big enough, you will easily get a 10x difference. For even larger sizes (once you thrash the TLB), you will get an additional factor. You might not get up to 60, but this is just a simple matrix addition. Even just matrix multiplication might be enough to get a 60x difference just with optimizing for memory hierarchy. I expect that high-performance memory bound programs are more complex than matrix multiplication, so the parent comment's claim doesn't seem too unlikely to me. |
|
The first time I encountered this myself, was doing texture mapping on the 486 back in the mid 90s. Texels would be laid out by row, and the mapper would draw horizontally. Texture maps that would fit within the L1 cache would draw fine no matter which rotation they were drawn it.
However texture maps that were larger than the L1-cache, would be drawn fine as long as the texture map wasn't significantly rotated. But, if you rotated the polygon by 90 degrees you'd see a significant drop in frames per second, because you'd skip a while row between drawing each texel, which would effectively cause a lot of cache misses. OK, I didn't manage to explain that well, but I hope it is understable.
Obviously, I have encountered this many times since, in many variants (including disk read/write), and I definitely agree that a factor of 10 should be possible.
I guess, I am just wondering if the difference between RAM and L1 has reached a factor 60, and how easy it is to come up with something where the cache misses all the time, or if caches have become smarter.