|
|
|
|
|
by gwern
101 days ago
|
|
Adding, swapping, or duplicating layers has a long history (eg. StyleGAN, upcycling), and it was pointed out at least as far back as He et al 2015 (Resnets) that you could ablate or add more layers because they functioned more as just doing some incremental compute iteratively, and many of them were optional. (Or consider Universal Transformers or heck, just how BPTT works.) So this idea is not far out of distribution, if at all, especially if you're a LLM who knows the literature and past approaches (which most humans would not because they only just got into this area post-ChatGPT). |
|
https://github.com/karpathy/autoresearch/blob/master/progres...
My opinion is you’d have to go pretty far down the x axis to get to anything that’s not things like tinkering with bs, lr, or positional encodings. There are so many hyperparameter knobs already exposed that duplicating layers is unlikely to be proposed for a long time.
I also just noticed that the last change it applied was changing the random seed. Lol.