| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by amatsukawa 3389 days ago
	Let me make sure I understand: rather than using gates, each step explicitly takes a weight average of all past steps using attention. What is the training speed of this network? Computation seems to scale at N^2 rather than N.

1 comments

jostmey 3389 days ago

EDIT: The attention mechanism is nothing more than a weighted average. The weighted average is computed as a running average by saving the numerator and denominator terms at each step.

I hope the description in the paper is clear. You can follow the ARXIV link in the README. Skip straight to section 2 for the details of the model

link