| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mp187 777 days ago
	Why was this your first thought? Is a limiting factor to transformers the MLP layer? I thought the bottleneck was in the renormalization part.

1 comments

At small input size, yes the MLP dominates compute. At large input attention matters more