|
|
|
|
|
by ryao
557 days ago
|
|
Usually, you can do 2 AVX-512 operations per cycle and using FMADD (fused multiply-add) instructions, you can do two floating point operations for the price of one. That would be 128 operations per cycle per core. The result would be 16TFlops on a 2GHz 64 core CPU, not 4 TFlops. This would give a 1 order of magnitude difference, rather than 4 orders of magnitude. For inference, prompt processing is compute intensive, while token generation is memory bandwidth bound. The differences in memory bandwidth between CPUs and GPUs tend to be more profound than the difference in compute. |
|