|
|
|
|
|
by petercooper
17 hours ago
|
|
Yeah, the patched llama.cpp. The reason is I saw that using the Q4 quant on vLLM is discouraged and the int8 won't fit on my 3090 Ti, but I could certainly give it a go. I also skipped Transformers as it needs to download the full weights and quantize them locally and I didn't fancy waiting for a 50GB download. |
|