|
|
|
|
|
by razodactyl
848 days ago
|
|
I have a theory that the results are actually a side effect of having the information in a different area of the context block. Models can be sensitive to the location of a needle in the haystack of its input block. It's why there are models which are great at single turn conversation but can't hold a conversation past that without multi-turn training. You can even corrupt the outputs by pushing past the number of turns / show the model data in a form it hasn't really seen before. |
|
But only if we use some sort of attention optimization. For the quadratic attention algo it shouldn’t matter where the needle is, right?