| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by csmpltn 907 days ago
	> "I suppose it may help improve overall performance, you're still just blindly predicting the next word based on the previous one like a first order Markov chain would." Instead of only taking in the last "token" as context to the function that generates the next token - take the last 15 tokens (ie. the last 2-3 sentences), and predict based on that. And that's your "attention" mechanism.