|
|
|
|
|
by hdhshdhshdjd
706 days ago
|
|
I don’t see any indication this beats Llama3 70B, but still requires a beefy GPU, so I’m not sure the use case. I have an A6000 which I use for a lot of things, Mixtral was my go-to until Llama3, then I switched over. If you could run this on say, stock CPU that would increase the use cases dramatically, but if you still need a 4090 I’m either missing something or this is useless. |
|
That's without the context window, so depending on how much context you want to use you'll need some more GB.
That is, assuming you'll be using llama.cpp (which is standard for consumer inference. Ollama is also llama.cpp, as is kobold)
This thing will run fine on a 16GB card, and a q6 quantization will run fine on a 12GB card.
You'll still get good performance on an 8GB card with offloading, since you'll be running most of it on the gpu anyway.