Whats better: Neural nets wider with less layers or thinner with more layers

Y	Hacker News new \| ask \| show \| jobs

	Whats better: Neural nets wider with less layers or thinner with more layers (vatsadev.github.io)
	32 points by vatsadev 719 days ago

3 comments

Grimblewald 718 days ago

It makes sense that lessons one learns from working with dense networks, applies to transformers as well since these are at their core still just dense networks.

The way I grew to understand the relationship, and I am happy to discuss this / receive feedback, is that a layer's width determines how much that layer can memorize while network depth determines the complexity of abstraction possible for the network to learn.

So a wide enough layer can simply remember everything while a deep enough network will be able to, through abstraction, recreate memories of everything using a simplification of the input.

Ideally, you want a balance of the two, since you don't want to rely on memory alone, as this doesn't tend to generalize well, nor do you want to deal with the fantasy outputs from something relying too heavily on abstraction, as this is not likely to be reliable.

link

MattPalmer1086 717 days ago

That makes a lot of sense, thanks for the explanation.

link

supple-mints 717 days ago

Is it harder to train the wider network or the deeper network all else equal?

link

vatsadev 717 days ago

Post author here, if you look at MFU, then the wider layers win out, and init takes much longer the more you add layer

link

esafak 717 days ago

Classic paper: Wide & Deep Learning for Recommender Systems.

https://paperswithcode.com/method/wide-deep

link

chessgecko 717 days ago

*edit neverming below this is a character level model that probably has a small vocab so it wouldn’t make a massive difference

Is this taking into account the parameters in the embedding and the output ffn? Because normally when models are really small and the vocab is large they can account for an extremely large number of parameters and would explain why the optimal number of layers here is unusually small.

I suspect it isn’t being taken into account because doubling the embedding and cutting the number of layers in half only holds the parameters constant if you forget the embedding and output, but id need to see more details on the config (mainly the vocab size he used) to confirm.

link

vatsadev 717 days ago

vocab size if like 56, not doing much

link