Hacker News new | ask | show | jobs
by sdenton4 1163 days ago
That's the point of having multiple realizations of the same underlying model.

The (depthwise) convolutional realization is extremely efficient for training, and the RNN is extremely efficient for inference. The scaling in both of these cases is much better than attention layers - as they discuss in the article.