|
|
|
|
|
by SekstiNi
1210 days ago
|
|
> Interesting, though apparently the OPT175B model is 350GB: Only in FP16. In the paper they use int4 quantization to reduce it to a quarter of that. In addition to the model weights, there's also a KV cache that takes up considerable amounts of memory, and they use int4 on that as well. > I wonder what FlexGen is doing.. a naive guess is a mix of SSD and system memory. That's correct, but other approaches have done this as well. What's "new" here seems to be the optimized data access pattern in combination with some other interesting techniques (prefetching, int4 quantization, CPU offload). |
|
The allowance for this more granular quantization seems to suggest the "bottleneck" is in some other aspect of the system, and maybe until that is addressed, a higher fidelity quantization does not improve performance.
Or maybe it's the relative values/ratio between weights that is important, and as long as the intended ratio between weights can be expressed, the exact precision of the weights themselves may not be important?
Found an interesting paper on this below. There's doubtless heavy research underway in this area
- https://www.researchgate.net/publication/367557918_Understan...