| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by oofbey 1612 days ago
	Why would evaluation be faster with more narrow layers? If there were fewer tokens it would definitely be faster, because transformers scale by tokens^2, but here "narrow" means number of channels, for presumably the same number of tokens.

1 comments

chabons 1612 days ago

If I remember correctly the fully connected layers after the attention block are [?, a*h] * [a*h, b*h] (for some scalars a,b and hidden size h), which means that transformers also scale with h^2. I don't know what fraction of the total FLOPs that section of the model takes for practical model sizes, but it would indicate that making the model narrower for the same number of params would reduce compute.

link

oofbey 1611 days ago

Oh yeah, I think you're right. If it's all fully-connected (and parts of a transformer are) then more thinner layers use fewer FLOPs than fewer thicker layers, for the same number of parameters. As long as the layers are wide enough to keep the GPU busy it'll run faster.

link