|
|
|
|
|
by cpburns2009
128 days ago
|
|
In my short testing on a different MoE model, it does not perform well. I tried running Kimi-K2-Thinking-GGUF with the smallest unsloth quantization (UD-TQ1_0, 247 GB), and it ran at 0.1 tps. According to its guide, you should expect 5 tps if the whole model can fit into RAM+VRAM, but if mmap has to be used, then expect less than 1 tps which matches my test. This was on a Ryzen AI Max+ 395 using ~100 GB VRAM. |
|