I understand that, which is why I said "Ignore LLM issues with character counting for this example". It was a quick example, please see my other comment with a better example.
I see, active listening + relating it to my knowledge on my end, lmk if I compressed too much:
you're curious if there's noticably worse performance if the Q is at the end of content rather than before
No, there's a good paper on this somewhere with the Claude 100K, tldr it's sort of bow-shaped, beginning and end had equally high rates but middle would suffer
No, what I am specifically asking about is these sliding window attention techniques. As far as I understand it Claude 100K actually uses a 100k context window, and not a sliding window.
you're curious if there's noticably worse performance if the Q is at the end of content rather than before
No, there's a good paper on this somewhere with the Claude 100K, tldr it's sort of bow-shaped, beginning and end had equally high rates but middle would suffer