Hacker News new | ask | show | jobs
by fundamental 1614 days ago
Perhaps the described patch based routing to experts isn't a problem in practice, but at first glance it does seem to discard more spatial information than you'd like as well as introducing more image boundaries than would be ideal. You could argue that the former is a known issue with many DNN architectures, though if the intent is to enable larger scale generalization it seems like this paper might be trading away more information in the source material for speed than would be desired. AFAIK the shuffling would be less of an issue in textual models than image processing tasks. As per the boundaries, I guess there could be padding in play, though I suspect that the resulting network is going to have higher sensitivities to shifts up/down or left/right by a few pixels.

Even with those issues I'd imagine there could be some nice benefits and the authors are correct (IMO) for leaning on the areas of conditional execution and routing as it allows for the network to specialize on a given subdomain while being computationally efficient. We'll have to see where subsequent work takes this approach.