Hacker News new | ask | show | jobs
by gkapur 37 days ago
On the limitation side:

Do you think this would scale to larger transformer models with more parameters per layer?

How would this work with MOE models or sparse models?