Hacker News new | ask | show | jobs
by loremaster 1114 days ago
Very, very recently. In the past few days. I tried it out immediately because GPTQ-for-LLaMA and hunting for or making quantized models can be tedious, but it was disappointingly slow. On a 3090 where I was getting responses for a given 13B model in 10-30 seconds, just using transformers with load_in_4bit took about ten times that for each response. There’s also the storage benefit of using actually quantized models.