|
|
|
|
|
by superkuh
19 days ago
|
|
>consumer-grade card with 12G of VRAM and got 5t/s That speed for token output indicates to me that it somehow is using hybrid mode and involving cpu+system ram somehow. That ~5tk/s is about the ram bandwidth of DDR4 RAM versus that size model at 4bit. Any consumer GPU with 12 GB like a nvidia rtx 2080 or rtx 3060 should be doing 20+ tk/s with llama.cpp and CUDA backend. |
|
I should play a bit more with llama.cpp options and see what bappened there. Thanks!