|
|
|
|
|
by szvsw
535 days ago
|
|
Just skimmed the abstract and it immediately has me wondering if there are ways that this research intersects with explainability research - specifically, the notion that certain inputs only activate certain portions of a network. I wonder if at some point that kind of information could be leveraged to provide better sorting of datasets into batches. Obviously this conflicts a little bit with the notion that you want your batches to be just completely random permutations. In some sense, if two inputs activate the same parts of the network at high intensity, then they are more likely to result in conflicting gradients if they are in separate microbatches from the same macrobatch. I’m curious if you think long term, there could be some utility in trying to periodically extract information about how samples activate the network and use that (or some other representation) to better sort the micro/macro batches to maximize your parallelism by making conflicts less likely. Obviously there would be some sort of time penalty dealing with calculating whatever that representation is and sorting so that overhead might far outweigh any gains, but I could see it as at least plausible that there might be some contexts/scales where making that periodic investment could pay off. |
|