|
|
|
|
|
by foota
998 days ago
|
|
I could be wrong, but I'm not sure this is about what people seem to think it is, e.g., letting LLMs reference content past the trained length I think it may just be about the performance of the model with longer texts (on the things still within the context window?). It sounds like they're arguing that the model is essentially learning to stick some baggage in the attention to the initial tokens of the text, and break when that isn't within the window anymore for reasons I'm not sure I understand (after all, isn't text in the middle just as good as text at the start for non instruction inputs?) |
|