|
|
|
|
|
by kpw94
3 days ago
|
|
> About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac Curious if you can share the prefill speed too? I run locally on a crappy desktop (some AMD iGPU with Vulkan llama.cpp, 32 GB DDR4 RAM) for experimentation. I get 15 tok/s on generation for the qwen & gemma4 MoE models. I get around 150 tok/s prefill speed. Reason I'm asking about the prefill is looking at my stats at work, I use between 20M to peaks of 300M input tokens daily. Some of those token are cached but in general, I seem to have roughly 500x more input tokens than output. So interested in prefill tok/s stats. Huge Thank you for llama.cpp btw!! |
|
Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster.
[0] https://github.com/ggml-org/llama.cpp/pull/19164