Hacker News new | ask | show | jobs
by mrcoder111 2689 days ago
How do you handle variable length input without something like an RNN? Even transformers use RNN structures right.

I suppose convolutions could technically handle variable length inputs (just slide the window of weights over different length inputs) but I don't think tensorflow or pytorch supports this

2 comments

>Even transformers use RNN structures right.

Nope.

>How do you handle variable length input without something like an RNN?

Any form of pooling, really. Max, Avg, Sum. The tricky part is how to do the pooling while still taking advantage of the sequential structure of the input information. The Transformer -based models have shown that you can get away with providing very little order information and still go very far.

Samcodes said it above. How do transformers build a shared representation of two input sentences with different lengths? If you convolve them with the same filter, you get two different sized convolution outputs - the embedding dimensions don't align.
Like I said - pooling.

You can take the mean over 3 elements or 10 elements just the same. Pooling is lossy, but it seems that if you have the right architecture the model can still learn what it needs to.

It's worth noting that the attention mechanism (at least in RNNs) has always been invariant to inputs lengths. It's a weighted sum with weights computed per element, so there's no length constraint at all.

Can you share some paper names or links to architectures that demonstrate the length invariant convolution and attention?
I'm not sure if you're understanding me correctly.

Attention is generally length invariant. You take some transformation on the hidden representations (/+ inputs) at that each time step, and then you normalize over all the transformed values to get weights that sum to one. No part of this is constrained by length.

For CNNs, any network that has pooling has the potential to be length/dimension invariant. Whether it actually is is a combination of the architectural design and an implementation detail (e.g. some implementations when trying to pool will specifically define a pooling operation over, say, a 9x9 window. You could define the same pooling operation over a variable-dimension window).

The length/dimension invariance aren't a special or novel property. In the case of attention it's built in. In the case of CNNs, the convolutions are not length invariant, but depending on the architecture, the pooling operations are (or can be modified to be).

In order to get a variable length context, you need to add some machinery to some forms of attention. For example, in jointly learning to align and translate, the attention is certainly not invariant to number of context vectors. You train the attention to take in a fixed number of context vectors and produce a distribution over the fixed number of context vectors. You cannot train on images with 5 annotations/context vectors and expect anything to transfer to a setting with 10 annotations. That's why I would be interested in a specific paper to solidify what you're saying.
The hard part is that after the convolutions you want a fully connected layer or two, and to get those dimensions right you need to know the input dimensions. But, pytorch is building the graph at runtime, so maybe you could do this...