|
|
|
|
|
by mhitza
68 days ago
|
|
It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4. You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed. |
|