|
|
|
|
|
by mrcoder111
2689 days ago
|
|
Samcodes said it above. How do transformers build a shared representation of two input sentences with different lengths? If you convolve them with the same filter, you get two different sized convolution outputs - the embedding dimensions don't align. |
|
You can take the mean over 3 elements or 10 elements just the same. Pooling is lossy, but it seems that if you have the right architecture the model can still learn what it needs to.
It's worth noting that the attention mechanism (at least in RNNs) has always been invariant to inputs lengths. It's a weighted sum with weights computed per element, so there's no length constraint at all.