| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vlejd 204 days ago
	I think the lack of efficient GPU kernels was the main problem. It is much, much easier to get a real speedup and memory reduction from quantization from fp16 to fp8 than from 50% sparsity. For sparsity you needed structure (which makes your model worse) and special hardware support.