|
|
|
|
|
by MiroF
2402 days ago
|
|
Perhaps I am missing the point of this article. The RNN approach seems to get similar performance, but uses more parameters and misses the parallelization benefits that Transformers have and recurrent networks do not. What is the benefit of the RNN here? |
|
The author doesn't do much absolute wall time comparison but does mention that only the adaptive transformer configuration trained in similar time on the single gpu.