|
|
|
|
|
by oofbey
1574 days ago
|
|
Oh yeah, I think you're right. If it's all fully-connected (and parts of a transformer are) then more thinner layers use fewer FLOPs than fewer thicker layers, for the same number of parameters. As long as the layers are wide enough to keep the GPU busy it'll run faster. |
|