| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by phire 482 days ago

BTW, I'd love to see a large model designed from scratch for efficient local inference on low-memory devices.

While current MoE implementations are tuned for load-balancing over large pools of GPUs, there is nothing stopping you tuning them to only switch expert once or twice per token, and ideally keep the same weights across multiple tokens.

Well, nothing stopping you, but there is the question of if it will actually produce a worthwhile model.

3 comments

regularfry 482 days ago

Intuitively it feels like there ought to be significant similarities between expert layers because there are fundamentals about processing the stream of tokens that must be shared just from the geometry of the problem. If that's true, then identifying a common abstract base "expert" then specialising the individuals as low-rank adaptations on top of that base would mean you could save a lot of VRAM and expert-swapping. But it might mean you need to train from the start with that structure, rather than it being something you can distil to.

link

phire 481 days ago

Yes, Deepseek introduced this optimisation of a common base "expert" that's always loaded. Llama 4 uses it too.

link

regularfry 481 days ago

I had a sneaking suspicion that I wouldn't be the first to think of it.

link

boroboro4 482 days ago

DeepSeek introduced novel experts training technique which increased experts specialization. For particular given domain their implementation tends to activate same experts between different tokens, which is kinda what you’re asking for!

link

jumski 482 days ago

I think Gemma 3 is marketed for single GPU setups https://blog.google/technology/developers/gemma-3/

link