|
|
|
|
|
by loremaster
1114 days ago
|
|
Very, very recently. In the past few days. I tried it out immediately because GPTQ-for-LLaMA and hunting for or making quantized models can be tedious, but it was disappointingly slow. On a 3090 where I was getting responses for a given 13B model in 10-30 seconds, just using transformers with load_in_4bit took about ten times that for each response.
There’s also the storage benefit of using actually quantized models. |
|