| HN Mirror

Sure, that's a reasonable idea but it isn't as easy as "la de da, add augmentation". I see this effect in models that already use noise injection btw. I should also add that certain resizing operations have led to my model collapsing rather than fixing it. It is not uncommon to see additional augmentation __HURT__ models rather than help them.

But I think what you're missing is that these effects are not well known. How many works do you see use PIL based resizing? How many use noise injection? Any augmentation? Are people considering that CUDA FMA is different than CPU FMA? That you get different results on different chipsets? I have no idea about TPUs, but I bet you there are differences. The rabbit hole I'm discussing is about more than the resizing operation, they exist everywhere.

What happens is that different works end up with different measurements and it becomes non-trivial to compare them. Remember, in the paper you only see benchmarks, you don't see the code and you just have to trust. But I can tell you that this particular mistake is the norm rather than the exception, and not in a consistent way. So say you write a generative model in tensorflow and the same one in pytorch, you get different results when measuring FID even if the model weights are identical. It's rather reasonable for a researcher/user to believe that the underlying libraries are implementing things in the best way. That's the point of these libraries in the first place. It's also not unreasonable for users to think that these effects constitute noise and that an iterative converging algorithm will find a signal through that noise. Remember, these are effects that can cause changes in leaderboard positions and thus, cause your work get rejected. It's the same work and has the same value, but if you only look at leaderboards you aren't comparing properly. It wouldn't be a big deal 4-5 years ago when our models were much worse, but now that we're pushing the limits of any metric, these types of effects can dominate.

The point is that there are extremely subtle effects going on that any reasonable person would not assume are happening. I'm willing to be that you didn't know these resizing differences existed until my previous comment. Brushing it off with the obviousness of hindsight is a terrible way to measure a priori understanding. It is the whole problem of not knowing what you don't know. We live in a specialized world, it is not expected that a ML researcher knows all the nuances of a specific library, cuda versions, compiler versions, and so on. Similarly it's not expected for programming language people to even know ML. You'd have to do like 5 PhDs before getting anywhere!

The tldr is just be cautious and take to heart that evaluation is hard and nefarious. A callous approach to evaluation, assuming everything is "obvious" or "simple," will quickly lead you to making errors.