You can break your sequence into two parts. One part goes through the encoder and the other goes through the decoder, so each token only goes through one transformer stack.
I think it's because most of the compute comes from the decoding, since you're doing it autoregressively, while the encoder you just feed it through once and get the embedding. So really all it's saying is that the decoder, with N parameters, is the compute bottleneck; hence encoder-decoder with N+N has similar order compute cost as decoder with N.