| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kippinitreal 1069 days ago
	Clever idea. I think you would have to recompute the context (ie embed the prior tokens) every time you swapped models because the weight distributions would be different for each model. Going from big->small might make this overhead worth it, but going back from small->big would assuredly be very costly.