Hacker News new | ask | show | jobs
by osanseviero 21 hours ago
Hi! What implementation are you using? Right now VLLM is the one recommended. llama.cpp is in an early draft
1 comments

Yeah, the patched llama.cpp. The reason is I saw that using the Q4 quant on vLLM is discouraged and the int8 won't fit on my 3090 Ti, but I could certainly give it a go. I also skipped Transformers as it needs to download the full weights and quantize them locally and I didn't fancy waiting for a 50GB download.