| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by phowon 2732 days ago

>For example, in jointly learning to align and translate, the attention is certainly not invariant to number of context vectors. You train the attention to take in a fixed number of context vectors and produce a distribution over the fixed number of context vectors

That's not true.

You compute an attention weight across however many context steps you have by computing an interaction between some current decoder hidden state and every encoder hidden state, and normalizing over all of them via a softmax. There is no constraint whatsoever on a fixed context length or a fixed number of context vectors. See section 3.1 in the paper.

I will be happy to discuss and clarify over email.

1 comments

mrcoder111 2728 days ago

Sounds good - can you send me an email? I put mine in my about

link