|
|
|
|
|
by menaerus
554 days ago
|
|
So you're saying that if I have a sentence of 10 words, and I want the LLM to predict the 11th word, FFWD compute is going to be independent of the context size? I don't understand how since that very context is what makes the likeliness of output of next prediction worthy, or not? More specifically, FFWD layer is essentially self attention output [context, d_model] matrix matmul'd with W1, W2 and W3 weights? |
|