How much slower (per unit area) is that to do in software, compared to a full 128-bit hardware unit?
See:
https://developer.nvidia.com/blog/cuda-11-6-toolkit-new-rele... https://developer.nvidia.com/blog/implementing-high-precisio...
See:
https://developer.nvidia.com/blog/cuda-11-6-toolkit-new-rele... https://developer.nvidia.com/blog/implementing-high-precisio...