Hacker News new | ask | show | jobs
by imtringued 332 days ago
>If the main thesis is "training on larger resolution results in better performance on high resolution images" then this seems to be a conclusion we already knew from a pure mathematical understanding of entropy, and is something many researchers have been discussing for decades.

I think you missed the part where the word performance is doing double duty here. Performance as in accuracy of the result and performance as in the time it takes to achieve said result.

The expectation is that training on a larger resolution will worsen performance in the second sense. You also mentioned that downsampling images will destroy information, hence FastVLM should also perform worse in the first sense, since it is clearly running its transformer layers on downsampled images through the patch embedding halving the image resolution with each layer.

To be fair, the presented network architecture does not really look like anything special. Three CNN layers with two transformer layers is just good product engineering. The real insight to be had here is that writing your own custom downsampling algorithm is a waste of time. You should make the downsampling learnable and part of the model.

1 comments

Sorry, I should clarify

  > The expectation is that training on a larger resolution will worsen performance in the second sense.

  > downsampling images will destroy information, hence FastVLM should also perform worse in the first sense
I do not think these are in contention. By training on larger images or embedding subnetwork can better learn to embed the requisite information. It need not hurt performance, in the sense of inference speed. This would require wise inference speeds were everything held equal or we just naively scaled. But it can actually be better if the learned algorithm is more efficient at extracting information, where there's the advantage of having access to more information. The larger resolution photo simply contains more information. On the other hand, if you train a model for a different downsampling task that information may not transfer well to the new downsampling task, which makes finetuning tricky and insufficient for a hard conclusion.

Note that their model is smaller. That actually can give us good analysis opportunities, as this suggests what I'm implying: more efficient embedding.

  > Three CNN layers with two transformer layers is just good product engineering. The real insight to be had here is that writing your own custom downsampling algorithm is a waste of time. You should make the downsampling learnable and part of the model.
Actually that's the reason I linked [3] is because it reminded me of that paper. They used an overlapping (convolution) patch-and-embed method in the ViT model as opposed to the hard standard partitioning. Which in effect, is the same conclusion: learn your downsampler (embedder)

I think we're pretty much in agreement. I just really want to see more ablations