Hacker News new | ask | show | jobs
by csmpltn 907 days ago
> "I suppose it may help improve overall performance, you're still just blindly predicting the next word based on the previous one like a first order Markov chain would."

Instead of only taking in the last "token" as context to the function that generates the next token - take the last 15 tokens (ie. the last 2-3 sentences), and predict based on that. And that's your "attention" mechanism.