| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by causal 660 days ago
	It's a different attention mechanism with a different map setup, so fundamentally a different type of model

1 comments

om8 660 days ago

Looks like it is a drop in replacement for attention, but models will need to be retrained for this one, yes.

link

aDyslecticCrow 660 days ago

It may not need to be entirely retrained. The value spans and input are the same, and no extra weights are needed. You may be able to tune an existing model with this attention mechanism and get some of the benefits.

But overall... it's mainly a training change, so training is needed to make a difference.

link