| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by terafo 804 days ago
	Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.

2 comments

tehsauce 804 days ago

If cpu softmax were limited by memory bandwidth, then these vectorization optimizations wouldn't improve performance.

link

cgearhart 804 days ago

Why does it disproportionately use bandwidth?

link

jacobn 804 days ago

In transformers the attention matrix is N*N, so there are a lot of values to go over. Typically makes it memory bandwidth bound, not compute bound.

link

cgearhart 804 days ago

Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!

Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?

link

bjornsing 804 days ago

Wouldn’t the softmax typically be “fused” with the matmul though?

link

anewhnaccount2 804 days ago

Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)

link