Hacker News new | ask | show | jobs
by deet 1040 days ago
It's possible to extend the effective context window of many OSS models using various techniques. The Llama-related models and others there's a technique called "RoPE scaling" which allows you to run inference over a longer context window than the model was originally trained for. (This reddit post help highlight this fact: https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkawar...)

But even at 100K, you do eventually run out of context. You would with 1M tokens too. 100K tokens is the new 64K of RAM, you're going to end up wanting more.

So techniques like RAG that others have mentioned are necessary in the end at some point, at least with models that look like they do today.