| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 60 days ago
	Ultimately it would amount to lazy-loading the model, but the parameters themselves would be fetched from the network as needed, which still decreases time-to-first-token. It's true that "expert" choices will span most of the model, regardless of any particular "subject" or "topic" choice, but if we simply care about time-to-first-token it's still a viable strategy.

1 comments

bavell 60 days ago

Perhaps you could generate a few tokens before the entire model is downloaded, but since every token takes a potentially different "path" through an MoE model, you'd still need to wait for the entire download before getting deeper than a handful of tokens... which is not really a UX improvement imo.

link

zozbot234 60 days ago

Even at its worst, it's a minor UX improvement compared to having to download everything prior to getting to the first token. Ultimately we will complete the download, but we can still pick the best priority so that the first handful of tokens goes through.

link