| HN Mirror

On 96gb I can give up to about 88GB to the GPU with sysctl iogpu.wired_limit_mb=88000, without suffering any ill-effects. When pushed higher I tend to notice e.g. graphic driver errors, youtube web page not working, other semi-random glitches. So the ~80 GB of DS4-flash quants I could just about fit. Leaving some extra for the KV caches. Will try, I'm curious how's the DS4 degradation with context depth growth, how fast does tok/s drop. E.g. 2-bit lowest quant MiniMax-M2.6 runs, but starts low tok/s and degrades fast with context depth.

The biggest models I can comfortably run are about 1/2 the DS4F size - like gpt-oss-120b. Lately was toying with Ling-2.6-flash. Got the agents to adapt existing metal kernels in llama.cpp, and it did run (model https://huggingface.co/ljupco/Ling-2.6-flash-GGUF, branch https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flas...). It's 104B-A7B4, and for the M2 Max 7.4B active is about the most it can take while still producing 40 tok/s. And the hybrid arch allows for graceful degradation, still close to 30 tok/s at 64K context depth.

Too bad L2.6F while the best have, is not that much better in agentic benchmarks compared to my current incumbent local llm (nemotron-cascade-2). Got inspired by DS4 to start a l26f branch (WIP https://github.com/ljubomirj/l26f). :-) Try squeeze the most from L2.6F. There should be low hanging fruit in good integration of the agent and the inferencing engine. On input - considering the huge difference cached v.s. non-cached tokens. On output - considering that the NN gives us the complete logits set for all 200K+ tokens vocabulary.