Hacker News new | ask | show | jobs
by deliciousturkey 5 days ago
I dislike the non-specificity of "models" here. Different models have different attention architectures, and can therefore have significant differences in long-context behavior. It's true that long context is an issue can most models do drop off in quality, but I would not extrapolate behavior of old models to new ones.
1 comments

could you explains on how? what changed with attention mechanisms to allow for such a shift?