|
|
|
|
|
by jgehring
3327 days ago
|
|
Yes, there have been a couple of attempts to use CNNs for translation already, but none of them outperformed big and well-tuned LSTM systems. We propose an architecture that is fast to run, easy to optimize and can scale to big networks, and could thus be used as a base architecture for future research. There are a couple of contributions in the paper (https://arxiv.org/abs/1705.03122) apart from demonstrating the feasibility of CNNs for translation, e.g. the multi-hop attention in combination with a CNN language model, the wiring of the CNN encoder[1], or an initialization scheme for GLUs that, when combined with appropriate scaling for residual connections, enables the training of very deep networks without batch normalization. [1] In previous work (https://arxiv.org/abs/1611.02344), we required two CNNs in the encoder: one for the keys (dot products) and one for the values (decoder input). |
|
It is true that QRNN had results on mostly small-scale benchmarks, but it seemed that Bytenet especially the second version had SOTA results both for language models with characters and for machine translation with characters on the same large-scale En-De WMT task that is used in this paper.
MT with characters, with regards to ordering, structure, etc, is potentially much harder than with words or word-pieces, since the encoded sequences are 5 or 6 times longer on average, and the meanings of words need to be built up from individual characters.