| HN Mirror

Sorry, I should clarify

  > The expectation is that training on a larger resolution will worsen performance in the second sense.

  > downsampling images will destroy information, hence FastVLM should also perform worse in the first sense

I do not think these are in contention. By training on larger images or embedding subnetwork can better learn to embed the requisite information. It need not hurt performance, in the sense of inference speed. This would require wise inference speeds were everything held equal or we just naively scaled. But it can actually be better if the learned algorithm is more efficient at extracting information, where there's the advantage of having access to more information. The larger resolution photo simply contains more information. On the other hand, if you train a model for a different downsampling task that information may not transfer well to the new downsampling task, which makes finetuning tricky and insufficient for a hard conclusion.

Note that their model is smaller. That actually can give us good analysis opportunities, as this suggests what I'm implying: more efficient embedding.

  > Three CNN layers with two transformer layers is just good product engineering. The real insight to be had here is that writing your own custom downsampling algorithm is a waste of time. You should make the downsampling learnable and part of the model.

Actually that's the reason I linked [3] is because it reminded me of that paper. They used an overlapping (convolution) patch-and-embed method in the ViT model as opposed to the hard standard partitioning. Which in effect, is the same conclusion: learn your downsampler (embedder)

I think we're pretty much in agreement. I just really want to see more ablations