|
|
|
|
|
by lambda
123 days ago
|
|
OK, with MiniMax M2.5 UD-Q3_K_XL (101 GiB), I can't really seem to fit the full context in even at smaller quants. Going up much above 64k tokens, I start to get OOM errors when running Firefox and Zed alongside the model, or just failure to allocate the buffers, even going down to 4 bit KV cache quants (oddly, 8 bit worked better than 4 or 5 bit, but I still ran into OOM errors). I might be able to squeeze a bit more out if I were running fully headless with my development on another machine, but I'm running everything on a single laptop. So looks like for my setup, 64k context with an 8 bit quant is about as good as I can do, and I need to drop down to a smaller model like Qwen3 Coder Next or GPT-OSS 120B if I want to be able to use longer contexts. |
|
Haven't tried different things like switching between Vulkan and ROCm yet.
But anyhow, that 17 tokens per second was on almost empty context. By the time I got to 30k tokens context or so, it was down in the 5-10 tokens per second, and even occasionally all the way down to 2 tokens per second.
Oh, and it looks like I'm filling up the KV cache sometimes, which is causing it to have to drop the cache and start over fresh. Yikes, that is why it's getting so slow.
Qwen3 Coder Next is much faster. MiniMax's thinking/planning seems stronger, but Qwen3 Coder Next is pretty good at just cranking through a bunch of tool calls and poking around through code and docs and just doing stuff. Also MiniMax seems to have gotten confused by a few things browsing around the project that I'm in that Qwen3 Coder Next picked up on, so it's not like it's universally stronger.