Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.
Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!
Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?
Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)