| HN Mirror

Yes, that's pretty accurate. Step 3 (attention) is repeated multiple times, i.e. for each layer in the decoder. With each additional layer, you incorporate more of the previously translated text as well as information about which parts of the source sentence representation were used to generate it. The independence of the current word from the previous words applies to the training phrase as a complete reference translation is provided and the model is trained to predict single next words only. This kind of computation would be very inefficient with an RNN: it would have to run over each word in every layer sequentially which prohibits efficient batching.

When generating a translation for a new sentence, the model uses classic beam search where the decoder is evaluated on a word-by-word basis. It's still pretty fast since the source-side network is highly parallelizable and running the decoder for a single word is relatively cheap.