Hacker News new | ask | show | jobs
by mp187 777 days ago
Why was this your first thought? Is a limiting factor to transformers the MLP layer? I thought the bottleneck was in the renormalization part.
1 comments

At small input size, yes the MLP dominates compute. At large input attention matters more