Hacker News new | ask | show | jobs
by flutetornado 584 days ago
GPU workloads are either compute bound (floating point operations) or memory bound (bytes being transferred across memory hierarchy.)

Quantizing in general helps with the memory bottleneck but does not help in reducing computational costs, so it’s not as useful for improving performance of diffusion models, that’s what it’s saying.

1 comments

Exactly. The smaller bit widths from quantization might marginally decrease the compute required for each operation, but they do not reduce the overall volume of operations. So, the effect of quantization is generally more impactful on memory use than compute.
Except in this case they quantized both the parameters and the activations leading to decreased compute time too.