Hacker News new | ask | show | jobs
by senseiV 792 days ago
Ive noticed the same on extremely small models aswell, magnitude is a positional encoding or a couple tokens, so its easy to grok?