|
|
|
|
|
by ThrownAllTheWay
1381 days ago
|
|
Also investing in a branch predictor when the intended workload doesn't seem at all scalar is a confusing choice to me. Also the 362 F16 TFLOPs sounds super impressive, except the memory bandwidth is I think 800 GB/s (or is it 5 times that? Or effectively less than that if data has to be passed along multiple hops? I'm a bit confused), which means having to do 1000 ops (or 200? or more?) on each 16 bit value loaded in. Maybe you could do that, but it feels like you'd probably end up bandwidth bound most of the time. |
|