Hacker News new | ask | show | jobs
by sp332 1062 days ago
With transformer models, it’s common to put instructions and system messages at the beginning of the input. But with this decay, the beginning of the input would always have the sparsest attention, right? Maybe the instructions should be moved to the end. But then again if it’s recurrent, you might want to prime it with a description of the task.
1 comments

Hypothetically, fine tuned transformer networks would learn to attend more to the beginning and end of the sequence since that is where the instructions typically are. And lo and behold a recent paper demonstrated this to be true.