Hacker News new | ask | show | jobs
by Havoc 1035 days ago
Does anyone know if larger context lengths are inherent worse at other task?

i.e. all other things being equal is a 8k model better at math than a 32k model

2 comments

There’s a couple models on huggingface that uses NTK/linear RoPE that you can play with. Vicuna and WizardLM both have a 16K context model. The biggest issue is that if you go to really high context, it sometimes does these weird repetitions. But to be fair, I only have tried the quantized models and 13B (highest I can run locally). Not sure if the repetition are an artifact of the rope or quantization or both.
They are more resource (time and memory) intensive in training and inference, that is their disadvantage. For a fair comparison you would have to compare a 8k to a 32k pre-trained model with otherwise similar hyper-parameters.

OP is about a 32k sugar-coated Llama 2, so I would expect it be similar in performance to other Llama 2 derivatives.

Is the increased resource usage inherent to the model or does it only happen when using the extra context? Like if your workflow currently fits in a 2k model would an 8k model be objectively worse and only worth using once you've filled the context up of a smaller model? Or would it be worth always using an 8k context model and just knowing it will get slower and more resource hungry as your context grows?

Sorry for the random question, I've just been curious about this for a while and unable to find out and you seem knowledgeable about these extended models.

The big cost is training with large chunks and you pay that regardless how large the chunks that you feed the model later are. At inference time you only pay what you use.

I think the context length is not a parameter of the model in the sense that it is set to a particular value but it is just the size of the chunks you feed in during training. The model will only ever be able to learn relationships within that length. In that sense it is an implicit property of the model.

At inference time you can well query the model with chunks larger than what it was trained with and it will answer without a blink. You just cannot expect the answers to contain meaningful information beyond the length the model was once trained with.