Hacker News new | ask | show | jobs
by szvsw 535 days ago
Just skimmed the abstract and it immediately has me wondering if there are ways that this research intersects with explainability research - specifically, the notion that certain inputs only activate certain portions of a network. I wonder if at some point that kind of information could be leveraged to provide better sorting of datasets into batches. Obviously this conflicts a little bit with the notion that you want your batches to be just completely random permutations.

In some sense, if two inputs activate the same parts of the network at high intensity, then they are more likely to result in conflicting gradients if they are in separate microbatches from the same macrobatch.

I’m curious if you think long term, there could be some utility in trying to periodically extract information about how samples activate the network and use that (or some other representation) to better sort the micro/macro batches to maximize your parallelism by making conflicts less likely.

Obviously there would be some sort of time penalty dealing with calculating whatever that representation is and sorting so that overhead might far outweigh any gains, but I could see it as at least plausible that there might be some contexts/scales where making that periodic investment could pay off.

1 comments

Yes! I think this a great area of research. If you think of the gradient values as a blame score for why you got the answer wrong, then you can have a lot of fun with exploring which weights light up for different problems. A note, in Ring All Reduce they actually don’t ever share the FULL gradient but instead blocks. So to put this into practice you’d have to show that you can do the thresholding on the block of gradients vs the full gradient which you may never be able to fit in VRAM. Will results still hold? I don’t know. I believe it would but that’s for the next paper.
Very cool! Glad to hear my intuition is on the right track… I’m very much on the applied ML for engineering design side as opposed to the bleeding edge research side, so in terms of multi-node training I haven’t done much more than spin up a few GPUs and let PyTorch Lightning handle the parallelism, but cool to try to keep up with this stuff.

Thanks for the response and good luck with this!