| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vessenes 89 days ago
	Thanks -- interesting. I like the idea of ablating layers. I guess you could get a differentiable stack that has a layer skip and layer copy/loop and a total memory use loss function; that would let someone ship either a big (usually ablate) or little (usually copy) model. The expert routing for longer sequences interests me a lot because the edge inference issue is always memory bandwidth.