Hacker News new | ask | show | jobs
by bjornsing 771 days ago
Wouldn’t the softmax typically be “fused” with the matmul though?
1 comments

Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)