| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mrcoder111 2736 days ago
	Samcodes said it above. How do transformers build a shared representation of two input sentences with different lengths? If you convolve them with the same filter, you get two different sized convolution outputs - the embedding dimensions don't align.

1 comments

phowon 2736 days ago

Like I said - pooling.

You can take the mean over 3 elements or 10 elements just the same. Pooling is lossy, but it seems that if you have the right architecture the model can still learn what it needs to.

It's worth noting that the attention mechanism (at least in RNNs) has always been invariant to inputs lengths. It's a weighted sum with weights computed per element, so there's no length constraint at all.

link

mrcoder111 2735 days ago

Can you share some paper names or links to architectures that demonstrate the length invariant convolution and attention?

link

phowon 2735 days ago

I'm not sure if you're understanding me correctly.

Attention is generally length invariant. You take some transformation on the hidden representations (/+ inputs) at that each time step, and then you normalize over all the transformed values to get weights that sum to one. No part of this is constrained by length.

For CNNs, any network that has pooling has the potential to be length/dimension invariant. Whether it actually is is a combination of the architectural design and an implementation detail (e.g. some implementations when trying to pool will specifically define a pooling operation over, say, a 9x9 window. You could define the same pooling operation over a variable-dimension window).

The length/dimension invariance aren't a special or novel property. In the case of attention it's built in. In the case of CNNs, the convolutions are not length invariant, but depending on the architecture, the pooling operations are (or can be modified to be).

link

mrcoder111 2734 days ago

In order to get a variable length context, you need to add some machinery to some forms of attention. For example, in jointly learning to align and translate, the attention is certainly not invariant to number of context vectors. You train the attention to take in a fixed number of context vectors and produce a distribution over the fixed number of context vectors. You cannot train on images with 5 annotations/context vectors and expect anything to transfer to a setting with 10 annotations. That's why I would be interested in a specific paper to solidify what you're saying.

link

phowon 2732 days ago

>For example, in jointly learning to align and translate, the attention is certainly not invariant to number of context vectors. You train the attention to take in a fixed number of context vectors and produce a distribution over the fixed number of context vectors

That's not true.

You compute an attention weight across however many context steps you have by computing an interaction between some current decoder hidden state and every encoder hidden state, and normalizing over all of them via a softmax. There is no constraint whatsoever on a fixed context length or a fixed number of context vectors. See section 3.1 in the paper.

I will be happy to discuss and clarify over email.

link

mrcoder111 2728 days ago

Sounds good - can you send me an email? I put mine in my about

link