| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sp332 1062 days ago
	With transformer models, it’s common to put instructions and system messages at the beginning of the input. But with this decay, the beginning of the input would always have the sparsest attention, right? Maybe the instructions should be moved to the end. But then again if it’s recurrent, you might want to prime it with a description of the task.

1 comments

ttul 1061 days ago

Hypothetically, fine tuned transformer networks would learn to attend more to the beginning and end of the sequence since that is where the instructions typically are. And lo and behold a recent paper demonstrated this to be true.

link