Do you think this would scale to larger transformer models with more parameters per layer?
How would this work with MOE models or sparse models?