| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hansvm 1414 days ago

This sort of thinking can unlock up to roughly a factor of two in a lot of architectures and a lot of operations (depending on cache friendliness, instruction decoding, and other such things that might interfere with apparent wins).

Classic example: Intel Haswell chips can only do addition on one of their execution units, so you double your throughput by doing a fused multiply-add on the other with a multiplicand of 1.

Classic example: For large enough matrix multiplications you can load your data in such a way that the problem really is CPU-bound, even if you have to load from disk. Double up your data size with an integer representation of the matrix and do some integer-backed emulated floating point instructions for a chunk of the matrix and floating point for the other. You reduce the factor in the O(n^(2.7..3)) and add some extra O(n^2) work and some extra space (assuming you don't waste too many instructions on full IEEE compliance and don't need it for that application).

And so on. It's a fun trick, and often the compiler isn't able to execute it effectively. The same idea applies with all but the most trivial loops; until recently (last 10 years or so?), gcc wouldn't interleave cumulative sums to reduce the data dependencies (just keep 4 running totals instead of 1, skip by 4 each loop, and merge the sums and any stragglers at the end -- distinct from loop unrolling in that the major gain is from better pipelining rather than just reduced loop overhead -- it still works with vectorized accumulators). Not that it matters if you're IO-bound (which you often are on that sort of problem), but for mediumish datasets the idea still applies.