| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by loremaster 1162 days ago
	Very, very recently. In the past few days. I tried it out immediately because GPTQ-for-LLaMA and hunting for or making quantized models can be tedious, but it was disappointingly slow. On a 3090 where I was getting responses for a given 13B model in 10-30 seconds, just using transformers with load_in_4bit took about ten times that for each response. There’s also the storage benefit of using actually quantized models.