| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by superkuh 19 days ago
	>consumer-grade card with 12G of VRAM and got 5t/s That speed for token output indicates to me that it somehow is using hybrid mode and involving cpu+system ram somehow. That ~5tk/s is about the ram bandwidth of DDR4 RAM versus that size model at 4bit. Any consumer GPU with 12 GB like a nvidia rtx 2080 or rtx 3060 should be doing 20+ tk/s with llama.cpp and CUDA backend.

2 comments

senko 19 days ago

Good catch. I haven't looked deeply into it. This is with Vulkan backend on Linux which I understand should be roughly comparable to CUDA? Gfx is rtx 3060(ti?).

I should play a bit more with llama.cpp options and see what bappened there. Thanks!

link

superkuh 19 days ago

I've had it happen in the past with llama.cpp on linux that the CPU will present itself as a vulkan device GPU1 with "PHYSICAL_DEVICE_TYPE_CPU" and had a mix-up. Might want to try llama-server --list-devices and then append --device Vulkan0 or whatever.

link

pja 18 days ago

The 8 bit quant runs at 36tps using Vulkan on my AMD rx9070.

link