| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Archit3ch 139 days ago
	I'm working with small matrices (e.g. 10x10 to 100x100), where I believe the effect of caches/pipelines/registers/etc will kick in before the O(N^2)-vs-O(N^3) discussion. Then dispatching to the hardware accelerators (SME2 FMLA or AMX FMA) and doing a _dense_ solve with 512-bit vectors could still be faster than a sparse solve at small matrix sizes or NEON. Though as mentioned elsewhere in the thread, these accelerators only offer throughput, and latency suffers...