|
|
|
|
|
by vlejd
204 days ago
|
|
I think the lack of efficient GPU kernels was the main problem. It is much, much easier to get a real speedup and memory reduction from quantization from fp16 to fp8 than from 50% sparsity. For sparsity you needed structure (which makes your model worse) and special hardware support. |
|