| HN Mirror

The big cost is training with large chunks and you pay that regardless how large the chunks that you feed the model later are. At inference time you only pay what you use.

I think the context length is not a parameter of the model in the sense that it is set to a particular value but it is just the size of the chunks you feed in during training. The model will only ever be able to learn relationships within that length. In that sense it is an implicit property of the model.

At inference time you can well query the model with chunks larger than what it was trained with and it will answer without a blink. You just cannot expect the answers to contain meaningful information beyond the length the model was once trained with.