| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by microtonal 859 days ago
	Nice finding and makes a lot of sense! It is somewhat related to classification heads using their own weighted representation of all transformer layer outputs. I only glanced the paper, but they don't seem to softmax ⍺_i for normalization?