BTW, it’s the same for GPUs. In DXBC shader byte code, mad instruction does FMA. When reporting theoretical FLOPs, GPU vendors count that as 2 float operations.
For example, I have GeForce 4070 Ti Super in my desktop. The chip has 8448 execution units; nVidia calls them CUDA cores but I don’t like the name, the correct number is 66 cores where each core can do 4 wavefronts of 32 threads each.
Anyway, these EUs can do one FP32 FMA each cycle, and the boost clock frequency is 2.61 GHz.
Multiplying these two numbers results in 22.04928E+12 cycles*EU/second, and nVidia reports 44E+12 FLOPs peak FP32 performance of the GPU.
For example, I have GeForce 4070 Ti Super in my desktop. The chip has 8448 execution units; nVidia calls them CUDA cores but I don’t like the name, the correct number is 66 cores where each core can do 4 wavefronts of 32 threads each. Anyway, these EUs can do one FP32 FMA each cycle, and the boost clock frequency is 2.61 GHz. Multiplying these two numbers results in 22.04928E+12 cycles*EU/second, and nVidia reports 44E+12 FLOPs peak FP32 performance of the GPU.