|
|
|
|
|
by sasjaws
111 days ago
|
|
No blog post, my llm expert friend told me this was kinda obvious when i shared it with him so i didnt think it was worth it. I can tell you how i got there, i did nanogpt, then tried to be smart and train a model with a loss function that targets 2 next tokens instead of one. Calculate the loss function and you'll see its exactly the same during training. Sibling commenter also mentions: > the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation." Hope that helps. |
|