Hacker News new | ask | show | jobs
by MiroF 2402 days ago
Perhaps I am missing the point of this article. The RNN approach seems to get similar performance, but uses more parameters and misses the parallelization benefits that Transformers have and recurrent networks do not.

What is the benefit of the RNN here?

1 comments

The parallelism in a transformer doesn't necessarily translate to less or faster compute. Each layer has to be computed in serial after the previous layer, and the computation of each attention head is quadratic in the size of of the input sequence. When used this way for language modeling, the transformer also has to be run step-by-step for inference, the parallelism that was a boon at training is no longer available.

The author doesn't do much absolute wall time comparison but does mention that only the adaptive transformer configuration trained in similar time on the single gpu.