Hacker News new | ask | show | jobs
by sasjaws 111 days ago
No blog post, my llm expert friend told me this was kinda obvious when i shared it with him so i didnt think it was worth it.

I can tell you how i got there, i did nanogpt, then tried to be smart and train a model with a loss function that targets 2 next tokens instead of one. Calculate the loss function and you'll see its exactly the same during training.

Sibling commenter also mentions:

> the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation."

Hope that helps.