|
|
|
|
|
by robotswantdata
82 days ago
|
|
Feels 100% vibe coded in a bad way. Llama.cpp already has KV compression and one of the turbo quant PRs will get merged at some point. If you don’t care about the fancy 3 bit, the q8 KV compression is good enough! Don’t bother with q4 ./build/bin/llama-server -m model.gguf \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-c 65536 Etc |
|
The benchmark shows a benefit of MLX engine, so it's user's choice which engine to use, aegis-ai supports both : )