| > Happy to answer questions. Do you expect instability between successive macrobatch gradients? That is, why are you comparing microgradients within a single batch, adding a whole bunch of serialization headaches, rather than comparing with the macrogradient of the previous step? Given your test setup of noisy labels, isn't the sequential accumulation of microgradients dangerous? Suppose we take this to the limit of a minibatch size of 1, and the first microgradient happens to come from an example with an incorrect label. If I understand this correctly, gradient filtering would seek to add all the gradients that are consistent with the bad example, rejecting the gradients that belong to good examples. The text only contemplates the perversity of rejecting all examples from a microbatch, not accidentally accepting only corrupted gradients. The filtered gradients are used via SGD with momentum (although equation (6) looks like momentum-free SGD). Have you seen / do you expect different results when supplying the filtered gradients to Adam/AdamW or other, more sophisticated optimizers? Your thresholding is binary, either accepting or rejecting an entire microgradient. Have you tested soft thresholding? Is there an information-theoretic way to explain this effect? In figure 7, why does GAF with a large threshold result in a lower validation accuracy than the baseline? In GAF-terms, the baseline accepts every microgradient, so I'd expect GAF to converge to the baseline result as the threshold increases. What does the figure-7-but-0%-error curve look like? |