What exactly do you mean with "they let the models train their own word embeddings", can you elaborate more on this or are there any current papers about this topic?
the embedding layer is the layer that converts the one hot word feature in to a continuous multi dimensional vector that the deep net can learn with.
they used to pretrain that layer separately with word2vec. now as it's just a neural net layer, they let the translation model train it with backprop on the main (translation /dialog / qa, etc) task as a regular layer