Hacker News new | ask | show | jobs
by drv 3218 days ago
The assembly version is using the packed version of the FMA instruction (that's what the "P" in the mnemonic stands for), but as far as I can tell, it's only using one of the packed values, whereas the instruction can calculate two (AVX) or four (AVX2) FMA operations at once. It might be possible to get some speedup by rearranging the calculation so it can use the full width of the vector registers - at first glance, at least the two sides of the division should be possible to calculate in parallel with half as many FMA instructions.