| HN Mirror

>For example, in jointly learning to align and translate, the attention is certainly not invariant to number of context vectors. You train the attention to take in a fixed number of context vectors and produce a distribution over the fixed number of context vectors

That's not true.

You compute an attention weight across however many context steps you have by computing an interaction between some current decoder hidden state and every encoder hidden state, and normalizing over all of them via a softmax. There is no constraint whatsoever on a fixed context length or a fixed number of context vectors. See section 3.1 in the paper.

I will be happy to discuss and clarify over email.