|
|
|
|
|
by Grimblewald
722 days ago
|
|
It makes sense that lessons one learns from working with dense networks, applies to transformers as well since these are at their core still just dense networks. The way I grew to understand the relationship, and I am happy to discuss this / receive feedback, is that a layer's width determines how much that layer can memorize while network depth determines the complexity of abstraction possible for the network to learn. So a wide enough layer can simply remember everything while a deep enough network will be able to, through abstraction, recreate memories of everything using a simplification of the input. Ideally, you want a balance of the two, since you don't want to rely on memory alone, as this doesn't tend to generalize well, nor do you want to deal with the fantasy outputs from something relying too heavily on abstraction, as this is not likely to be reliable. |
|