| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gwern 101 days ago
	Adding, swapping, or duplicating layers has a long history (eg. StyleGAN, upcycling), and it was pointed out at least as far back as He et al 2015 (Resnets) that you could ablate or add more layers because they functioned more as just doing some incremental compute iteratively, and many of them were optional. (Or consider Universal Transformers or heck, just how BPTT works.) So this idea is not far out of distribution, if at all, especially if you're a LLM who knows the literature and past approaches (which most humans would not because they only just got into this area post-ChatGPT).

1 comments

janalsncm 101 days ago

I don’t disagree, but it’s worth having a look at the changes the LLM did apply.

https://github.com/karpathy/autoresearch/blob/master/progres...

My opinion is you’d have to go pretty far down the x axis to get to anything that’s not things like tinkering with bs, lr, or positional encodings. There are so many hyperparameter knobs already exposed that duplicating layers is unlikely to be proposed for a long time.

I also just noticed that the last change it applied was changing the random seed. Lol.

link

gwern 101 days ago

My understanding was that Autoresearch was defined as training from scratch (since it's based on the nanogpt speedrun), not using any pretrained models. So it couldn't do anything like upcycling a pretrained model or the Frankenmerge, because it's not given any access to such a thing in the first place. (If it could, the speedrun would be pointless as it would mostly benchmark what is the fastest fileserver you can download a highly compressed pretrained model checkpoint from...) It can increase the number of layers for a new architecture+run, but that's not the same thing.

link