Hacker News new | ask | show | jobs
by npsomaratna 1056 days ago
My understanding is that in NTK aware RoPE scaling, the model does pay uniform attention. With older methods, not as much.