| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by menaerus 564 days ago

> The only thing which is larger is the self attention calculation which is quadratic wrt compute and linear wrt memory if you use FlashAttention or similar fused self attention calculations.

FFWD input is self-attention output. And since the output of self-attention layer is [context, d_model], FFWD layer input will grow as well. Consequently, FFWD layer compute cost will grow as well, no?

The cost of FFWD layer according to my calculations is ~(4+2 * true(w3)) * d_model * dff * n_layers * context_size so the FFWD cost grows linearly wrt the context size.

So, unless I misunderstood the transformer architecture, larger the context the larger the compute of both self-attention and FFWD is?

1 comments

Kubuxu 564 days ago

FFWD later is independent of context size, each processed token passes thought the same weights.

link

menaerus 564 days ago

So you're saying that if I have a sentence of 10 words, and I want the LLM to predict the 11th word, FFWD compute is going to be independent of the context size?

I don't understand how since that very context is what makes the likeliness of output of next prediction worthy, or not?

More specifically, FFWD layer is essentially self attention output [context, d_model] matrix matmul'd with W1, W2 and W3 weights?

link