|
|
|
|
|
by zozbot234
60 days ago
|
|
Ultimately it would amount to lazy-loading the model, but the parameters themselves would be fetched from the network as needed, which still decreases time-to-first-token. It's true that "expert" choices will span most of the model, regardless of any particular "subject" or "topic" choice, but if we simply care about time-to-first-token it's still a viable strategy. |
|