| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jmward01 88 days ago
	I think this shows a shift in model architecture. MOE and similar need more memory for the compute available than just one big model with a lot of layers and weights. I think this is likely a trend that will accelerate. You build the trade-off in which encourages even more experts which means more of a tradeoff, so more experts.....

1 comments

zozbot234 88 days ago

Most people doing local inference run the MoE layers on CPU anyway, because decode is not compute constrained and wasting the high-bandwidth VRAM on unused weights is silly. It's better to use it for longer context. Recent architectures even offload the MoE experts to fast (PCIe x4 5.0 or similar performance) NVMe: it's slow but it opens up running even SOTA local MoE models on ordinary hardware.

link

jmward01 88 days ago

I think you are making my point. Having a little slower, but a lot more, memory on the card would speed this use-case up a lot and remove the need to go to system memory or make it available for very rarely used experts allowing for even larger MOE models running with good performance.

link

zozbot234 88 days ago

I think speeding up long context and opening up the use of models with larger shared layers is ultimately more relevant than hosting unused MoE layers. Of course you could do that as a last resort, i.e. when running with a smaller context that leaves some VRAM free to use.

link

jmward01 88 days ago

Long context will be solved and capped and turned into a theta 1 operation or, at worst, theta log(n). People don't have infinite perfect recall so agents don't need it. Also, there are really good solutions to it that just aren't explored enough right now since transformer architectures are where everyone is dumping money and time. I suspect very soon somone will have a much better system that just takes over and then the idea of context limits will be a thing of the past. I've actually built something myself that allows infinite context/perfect recall in theta 1 (minor asterisk here as there has to be but meh). I know others have solutions too.

link

zozbot234 87 days ago

There's already models with capped long context but if you make that the whole model it makes needle-in-haystack search impossible and that's actually a very common operation. Which is why Qwen 3.5 only makes a portion of it capped, and AIUI the new Nemotron models are broadly similar.

link

arw0n 87 days ago

See also the new Deepseek paper on engram transformers for some progress in this area: https://arxiv.org/pdf/2601.07372v1

They observe significant gains in factual knowledge retrieval capabilities, but reasoning barely moves the needle.

link