Hacker News new | ask | show | jobs
by razodactyl 848 days ago
I have a theory that the results are actually a side effect of having the information in a different area of the context block.

Models can be sensitive to the location of a needle in the haystack of its input block.

It's why there are models which are great at single turn conversation but can't hold a conversation past that without multi-turn training.

You can even corrupt the outputs by pushing past the number of turns / show the model data in a form it hasn't really seen before.

1 comments

Models can be sensitive to the location of a needle in the haystack of its input block.

But only if we use some sort of attention optimization. For the quadratic attention algo it shouldn’t matter where the needle is, right?