Y
Hacker News
new
|
ask
|
show
|
jobs
by
mp187
777 days ago
Why was this your first thought? Is a limiting factor to transformers the MLP layer? I thought the bottleneck was in the renormalization part.
1 comments
brrrrrm
777 days ago
At small input size, yes the MLP dominates compute. At large input attention matters more
link