|
|
|
|
|
by gpderetta
536 days ago
|
|
I don't think there is a generally optimal design. There are cons and pros to using the same homogeneous FMAs units for adds, multiplies and fmas, even at the cost of making adds slower (simpler design, and having all instructions of the same latency greatly simplifies scheduling). IIRC intel cycled through 4 cycles fma, add and mul, then to 4 cycles add and mul and 5 cycles fmas, then with a dedicated 3 cycles add. The optimal design depends a lot on the rest of the microarchitecture, the loads the core is being optimized for, the target frequency, the memory latency, etc. |
|