Hacker News new | ask | show | jobs
by CGamesPlay 3329 days ago
I'm relatively novice to machine learning but here's my best attempt to summarize what's going on in layman's terms. Please correct me if I'm wrong.

- Encode the words in the source (aka embedding, section 3.1)

- Feed every run of k words into a convolutional layer producing an output, repeat this process 6 layers deep (section 3.2).

- Decide on which input word is most important for the "current" output word (aka attention, section 3.3).

- The most important word is decoded into the target language (section 3.1 again).

You repeat this process with every word as the "current" word. The critical insight of using this mechanism over an RNN is that you can do this repetition in parallel because each "current" word does not depend on any of the previous ones.

Am I on the right track?

1 comments

Yes, that's pretty accurate. Step 3 (attention) is repeated multiple times, i.e. for each layer in the decoder. With each additional layer, you incorporate more of the previously translated text as well as information about which parts of the source sentence representation were used to generate it. The independence of the current word from the previous words applies to the training phrase as a complete reference translation is provided and the model is trained to predict single next words only. This kind of computation would be very inefficient with an RNN: it would have to run over each word in every layer sequentially which prohibits efficient batching.

When generating a translation for a new sentence, the model uses classic beam search where the decoder is evaluated on a word-by-word basis. It's still pretty fast since the source-side network is highly parallelizable and running the decoder for a single word is relatively cheap.