Hacker News new | ask | show | jobs
by yldedly 1085 days ago
What do you mean? Attention is just matrix multiplication and softmax, it's all feed forward.
1 comments

I wasn't disputing that it's feed-forward. I just meant that stacked transformer layers can be thought of as an iterative refinement of the intermediate activations. Not the same as an autoregressive process that receives previous outputs as inputs, but far more expressive than a single transformer layer.