|
|
|
|
|
by fy20
45 days ago
|
|
Running it on a Macbook Pro M5 48GB: -hf unsloth/Qwen3.6-27B-GGUF:UD-Q6_K_XL \
-c 128000 \
--parallel 1 \
--flash-attn on \
--no-context-shift \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence_penalty 0.0 \
--reasoning on \
--jinja \
--chat-template-kwargs "{\"preserve_thinking\": true}" \
--spec-type ngram-simple \
--draft-max 64 \
--timeout 1800
Maybe someone knows any tips to optimise prompt processing as that's the slowest part? It takes a few minutes before OpenCode with ~20k initial context first responds, but subsequent responses are pretty fast due to caching. |
|
note: 27b is going to be slow; use the 35b MoE if you want decent token/sec speed.