| A bit baffled by this because on every axis I look this seems like a dream of a compilation target. * No DRAM or caches, everything is in SRAM, and all local SRAM loads are 1 cycle. * Model parallel alone is full performance, no need for data parallel if you size to fit. * Defects are handled in hardware; any latency differences are hidden & not in load path anyway. * Fully asynchronous/dataflow by default, only need minimal synchronization between forward/backward passes. I genuinely don't know how you'd build a simpler system than this. |
In particular, when you change the system from having to worry about how to optimally schedule a single state machine to having to place operations on a fixed routing grid (à la FPGA), the problem becomes radically different, and any looping control flow becomes an absolute nail-biter of an issue.