|
|
|
|
|
by robbru
408 days ago
|
|
Interesting benchmarks, thanks for sharing! If you're optimizing for lower power draw + higher throughput on Mac (especially in MLX), definitely keep an eye on the Desloth LLMs that are starting to appear. Desloth models are basically aggressively distilled and QAT-optimized versions of larger instruction models (think: 7B → 1.3B or 2B) designed specifically for high tokens/sec at minimal VRAM. They're tiny but surprisingly capable for structured outputs, fast completions, and lightweight agent pipelines. I'm seeing Desloth-tier models consistently hit >50 tok/sec on M1/M2 hardware without needing active cooling ramps, especially when combined with low-bit quant like Q4_K_M or Q5_0. If you care about runtime efficiency per watt + low-latency inference (vs. maximum capability), these newer Desloth styled architectures are going to be a serious unlock. |
|