| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by petercooper 51 days ago
	Yeah, the patched llama.cpp. The reason is I saw that using the Q4 quant on vLLM is discouraged and the int8 won't fit on my 3090 Ti, but I could certainly give it a go. I also skipped Transformers as it needs to download the full weights and quantize them locally and I didn't fancy waiting for a 50GB download.