Hacker News new | ask | show | jobs
by Tunabrain 3424 days ago
Any entry level optimization course will make you implement something like that; do a matrix+matrix addition, but traverse it in row-major order in one program, and column-major in the other. If the matrix is big enough, you will easily get a 10x difference. For even larger sizes (once you thrash the TLB), you will get an additional factor.

You might not get up to 60, but this is just a simple matrix addition. Even just matrix multiplication might be enough to get a 60x difference just with optimizing for memory hierarchy. I expect that high-performance memory bound programs are more complex than matrix multiplication, so the parent comment's claim doesn't seem too unlikely to me.

1 comments

Sorry, I didn't mean to question it whether it is possible. I just think that a factor of 60 sounds tricky to achieve if we are just talking about RAM (no disk involved).

The first time I encountered this myself, was doing texture mapping on the 486 back in the mid 90s. Texels would be laid out by row, and the mapper would draw horizontally. Texture maps that would fit within the L1 cache would draw fine no matter which rotation they were drawn it.

However texture maps that were larger than the L1-cache, would be drawn fine as long as the texture map wasn't significantly rotated. But, if you rotated the polygon by 90 degrees you'd see a significant drop in frames per second, because you'd skip a while row between drawing each texel, which would effectively cause a lot of cache misses. OK, I didn't manage to explain that well, but I hope it is understable.

Obviously, I have encountered this many times since, in many variants (including disk read/write), and I definitely agree that a factor of 10 should be possible.

I guess, I am just wondering if the difference between RAM and L1 has reached a factor 60, and how easy it is to come up with something where the cache misses all the time, or if caches have become smarter.

Googling puts RAM latency avoiding caches on modern processors at 40-60 cycles.
Exactly... Now, that alone could make it tricky to get a factor 60 difference.
Write a simplistic O(N^3) algortihm

When N is 100, all fixed costs in the inner loop are multiplied by 1,000,000. So a nanosecond difference is now a millisecond. It's pretty easy to get worse than that by boneheaded data layout (row-major vs column-major is pretty common)

No...

If you use a "bonehead" data structure each memory reference will force a cache-miss, i.e. taking q CPU cycles instead of 1 CPU cycles.

If the algorithms has a running time of O(N^3), it means it is using some number of instructions <= c * N^3 to complete. Let's for simplicity sake say 1 operation is identical to 1 CPU-cycle.

Now, if each operation takes q CPU cycles instead of 1 cycles due to cache-miss on every operation, the running time of your algorithm is still bounded by q * (c * N^3) operations.

For the example with N=100 -> The fast version takes c * 1,000,000 operations to complete, for some value c. For the slow version, as each operation is now q times slower, it will take q * c * 1,000,000 operations to complete for values q and c.

I.e. it takes q times as long, not q * q * q.

What you were saying is that, if you run a O(N) algorithm on a 100 MHZ CPU, it will take 10 times as long as on a 1000 MHZ CPU, but an O(N^3) algorithm would take 1000 times as long. And that is obviously, not true. Both will take 10 times longer.

No

I'm saying that a poor choice of data structure will have an amplified impact on an O(N^3) algorithm.

Giving it as an example of where you could have the different implementations of the same algorithm have massively different runtimes on the same processor.