Hacker News new | ask | show | jobs
by ozgrakkurt 50 days ago
Did you try GPU/CPU mix with a bigger model?
1 comments

Prompt processing is absolutely punishing:

    ./llama-batched-bench -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ4_NL -npp 1000 -ntg 128 -npl 1 --cache-type-k q8_0 --cache-type-v q8_0 -c 18000 --n-cpu-moe 32
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |   53.961 |    18.53 |    9.223 |    13.88 |   63.184 |    17.85 |