Hacker News new | ask | show | jobs
by ThrownAllTheWay 1381 days ago
Also investing in a branch predictor when the intended workload doesn't seem at all scalar is a confusing choice to me. Also the 362 F16 TFLOPs sounds super impressive, except the memory bandwidth is I think 800 GB/s (or is it 5 times that? Or effectively less than that if data has to be passed along multiple hops? I'm a bit confused), which means having to do 1000 ops (or 200? or more?) on each 16 bit value loaded in. Maybe you could do that, but it feels like you'd probably end up bandwidth bound most of the time.
1 comments

My understanding is they load in weights occasionally into sram and then pump in training data on the sides of the die and have multiple cores operate on a wavefront of data. So the cores don't compete for host memory bandwidth because the same data flows (transformed) through multiple cores.