|
|
|
|
|
by JohnTheNerd
889 days ago
|
|
yes, they are the 16GB models. beware that the memory bus limits you quite a bit. however, buying brand new, they are the best VRAM per dollar in the NVIDIA world as far as I could see. I use 4-bit GPTQ quants. I use tensor parallelism (vLLM supports it natively) to split the model across two GPUs, leaving me with exactly zero free VRAM. there are many reasons behind this decision (some of which are explained in the blog): - TheBloke's GPTQ quants only support 4-bit and 3-bit. since the quality difference between 3-bit and 4-bit tends to be large, I went with 4-bit. I did not test, but I wanted high accuracy for non-assistant tasks too, so I simply went with 4-bit. - vLLM only supports GPTQ, AWQ, and SqueezeLM for quantization. vLLM was needed to serve multiple clients at a time and it's very fast (I want to use the same engine for multiple tasks, this smart assistant is only one use case). I get about 17 tokens/second, which isn't great, but very functional for my needs. - I chose GPTQ over AWQ for reasons I discussed in the post, and don't know anything about SqueezeLM. |
|
3060 12gb is cheaper upfront and a viable alternative. 3090ti used is also cheaper $/vram although a power hog.
4060 16gb is a nice product, just not for gaming. I would wait for price drops because Nvidia just released the 4070 super which should drive down the cost of the 4060 16gb. I also think the 4070ti super 16gb is nice for hybrid gaming/llm usage.