| HN Mirror

The parallelism in a transformer doesn't necessarily translate to less or faster compute. Each layer has to be computed in serial after the previous layer, and the computation of each attention head is quadratic in the size of of the input sequence. When used this way for language modeling, the transformer also has to be run step-by-step for inference, the parallelism that was a boon at training is no longer available.

The author doesn't do much absolute wall time comparison but does mention that only the adaptive transformer configuration trained in similar time on the single gpu.