As someone with a lot of interest in but no fluency with chip design, or the dividing and conquering of math within silicon, for that matter, how would you multiply a 1m² matrix?
Each "unit of work" in matrix multiplication is not dependent on any other unit of work. Stuff as many cores as you can into a chip, and then simply feed in all your vectors at the same time.
A million element square matrix is a lot of data. To process that in a second is much more bandwidth than a single socket can support, so you'll need many sockets too.
Each "unit of work" in matrix multiplication is not dependent on any other unit of work. Stuff as many cores as you can into a chip, and then simply feed in all your vectors at the same time.
I.e. basically a beefed up GPU or an "AI" chip.