| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gkapur 37 days ago

On the limitation side:

Do you think this would scale to larger transformer models with more parameters per layer?

How would this work with MOE models or sparse models?