That definitely changes the calculus, but as I've mentioned in a different comment there doesn't seem to be literally any microarchitectural documentation to read, so I (don't own an M1) have nothing to go off unfortunately.
I'll make a wild guess that getting data to the neural engine is still probably not quick because I assume it's some kind of statically scheduled type affair (exposed pipeline?). We literally seem to know almost nothing about it sadly.
I'll make a wild guess that getting data to the neural engine is still probably not quick because I assume it's some kind of statically scheduled type affair (exposed pipeline?). We literally seem to know almost nothing about it sadly.
https://jobs.apple.com/en-gb/details/200205070/neural-engine...
Even the job listing gives away next to nothing, other than poor English "Knowledge in compiler is a plus" ;)