Hacker News new | ask | show | jobs
by coder543 166 days ago
Even though big, dense models aren't fashionable anymore, they are perfect for specdec, so it can be fun to see the speedup that is possible.

I can get about 20 tokens per second on the DGX Spark using llama-3.3-70B with no loss in quality compared to the model you were benchmarking:

    llama-server \
        --model      llama-3.3-70b-instruct-ud-q4_k_xl.gguf \
        --model-draft llama-3.2-1b-instruct-ud-q8_k_xl.gguf \
        --ctx-size      80000 \
        --ctx-size-draft 4096 \
        --draft-min 1 \
        --draft-max 8 \
        --draft-p-min 0.65 \
        -ngl 999 \
        --flash-attn on \
        --parallel 1 \
        --no-mmap \
        --jinja \
        --temp 0.0 \
        -fit off
Specdec works well for code, so the prompt I used was "Write a React TypeScript demo".

    prompt eval time = 313.70 ms / 40 tokens (7.84 ms per token, 127.51 tokens per second)
    eval time = 46278.35 ms / 913 tokens (50.69 ms per token, 19.73 tokens per second)
    total time = 46592.05 ms / 953 tokens
    draft acceptance rate = 0.87616 (757 accepted / 864 generated)
The draft model cannot affect the quality of the output. A good draft model makes token generation faster, and a bad one would slow things down, but the quality will be the same as the main model either way.
1 comments

thanks