| HN Mirror

I think it's because most of the compute comes from the decoding, since you're doing it autoregressively, while the encoder you just feed it through once and get the embedding. So really all it's saying is that the decoder, with N parameters, is the compute bottleneck; hence encoder-decoder with N+N has similar order compute cost as decoder with N.