| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by weinzierl 1035 days ago
	They are more resource (time and memory) intensive in training and inference, that is their disadvantage. For a fair comparison you would have to compare a 8k to a 32k pre-trained model with otherwise similar hyper-parameters. OP is about a 32k sugar-coated Llama 2, so I would expect it be similar in performance to other Llama 2 derivatives.

1 comments

semi 1035 days ago

Is the increased resource usage inherent to the model or does it only happen when using the extra context? Like if your workflow currently fits in a 2k model would an 8k model be objectively worse and only worth using once you've filled the context up of a smaller model? Or would it be worth always using an 8k context model and just knowing it will get slower and more resource hungry as your context grows?

Sorry for the random question, I've just been curious about this for a while and unable to find out and you seem knowledgeable about these extended models.

weinzierl 1035 days ago

The big cost is training with large chunks and you pay that regardless how large the chunks that you feed the model later are. At inference time you only pay what you use.

I think the context length is not a parameter of the model in the sense that it is set to a particular value but it is just the size of the chunks you feed in during training. The model will only ever be able to learn relationships within that length. In that sense it is an implicit property of the model.

At inference time you can well query the model with chunks larger than what it was trained with and it will answer without a blink. You just cannot expect the answers to contain meaningful information beyond the length the model was once trained with.