On die communication isn’t free, a lot of things here are sequential and within matrix multiplies the cores have to transfer output and mem loads have to be distributed. It’s really fast but not like one cycle
You could add a series of latches, and use the magic of graph coloring to eliminate any timing issues, and pipeline the thing sufficiently to get a GHz of throughput, even if it takes many cycles to make it all the way though the pipe.
Personally, I'd put all the parameters in NOR flash, then cycle through the row lines sequentially to load the parameters into the MAC. You could load all the inputs in parallel as fast as the dynamic power limits of the chip allow. If you use either DMA or a hardware ring buffer to push all the tokens through the layers, you could keep the throughput going with various sizes of models, etc.
Obviously with only one MAC you couldn't have a single stream at a GHZ, but you could have 4000 separate streams of 250,000 tokens/second.
Personally, I'd put all the parameters in NOR flash, then cycle through the row lines sequentially to load the parameters into the MAC. You could load all the inputs in parallel as fast as the dynamic power limits of the chip allow. If you use either DMA or a hardware ring buffer to push all the tokens through the layers, you could keep the throughput going with various sizes of models, etc.
Obviously with only one MAC you couldn't have a single stream at a GHZ, but you could have 4000 separate streams of 250,000 tokens/second.