|
|
|
|
|
by liamkf
5340 days ago
|
|
Hmm, that seems a bit misleading. The SSE implementation will of course be much faster than a non-vectorized implementation... it's working on 4 floats at a time. A vectorized version of the marvelous method would be a fairer comparison. If your data is not very vectorizable and you don't need high precision, it's still fairly marvelous. |
|
The SSE version is 16x faster than the x87 FPU version and almost 4x faster than the marvelous method. Further, the marvelous method is a loop and is much more resource-hungry than an instruction that goes away for 5-10 cycles (reciprocal throughput is now 1 cycle) but leaves you with most of your execution resources to do other useful work.
Vectorizing the marvelous method is very likely to be painful, as it has the treatment of a value as both a float and an int - while the xmm registers are dual-purpose operating on one successively as a int and a float causes some extra latency. The use of bit shifts and float operations require these cross-domain transfers...