| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by flutetornado 630 days ago
	GPU workloads are either compute bound (floating point operations) or memory bound (bytes being transferred across memory hierarchy.) Quantizing in general helps with the memory bottleneck but does not help in reducing computational costs, so it’s not as useful for improving performance of diffusion models, that’s what it’s saying.

1 comments

pkAbstract 630 days ago

Exactly. The smaller bit widths from quantization might marginally decrease the compute required for each operation, but they do not reduce the overall volume of operations. So, the effect of quantization is generally more impactful on memory use than compute.

link

superkuh 630 days ago

Except in this case they quantized both the parameters and the activations leading to decreased compute time too.

link