| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vlovich123 265 days ago
	That’s a very basic way to keep the LLM inferring past the context window size (there’s better, smarter ways) but that’s not at all what the question was which is how they train a 2M token length window. My understanding at a basic level is that you need corpuses that are >2M in length for training data which is where the problem comes in for - there’s only so much long form content and it’s swamped by all the smaller stuff. I think there’s probably tricks now but I suspect it’s still largely an open problem.

1 comments

Ey7NFZ3P0nzAe 265 days ago

AFAIK nobody does that. They train on much much shorter text but with use tricks in the position encoding steps that can be extrapolated by the LLMs. Lile ROPE and YARN etc.

link

ErikBjare 264 days ago

AFAIK (not much) it definitely helps to train on longer sequences even with rope/yarn and is needed if you care about long context performance (and not just the long context capability).

link