| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alfiedotwtf 13 hours ago
	I'm pretty sure you could do n-expert capping on any MoE model with only a handful lines of changes to ik_llama.cpp, but yeah... my bet is the have various quantisations and run the lower ones at peak (along with different system prompts i.e we're GPU-bound right now. Get to the point with less chatter)