Hacker News new | ask | show | jobs
by jgehring 3327 days ago
Yes, there have been a couple of attempts to use CNNs for translation already, but none of them outperformed big and well-tuned LSTM systems. We propose an architecture that is fast to run, easy to optimize and can scale to big networks, and could thus be used as a base architecture for future research.

There are a couple of contributions in the paper (https://arxiv.org/abs/1705.03122) apart from demonstrating the feasibility of CNNs for translation, e.g. the multi-hop attention in combination with a CNN language model, the wiring of the CNN encoder[1], or an initialization scheme for GLUs that, when combined with appropriate scaling for residual connections, enables the training of very deep networks without batch normalization.

[1] In previous work (https://arxiv.org/abs/1611.02344), we required two CNNs in the encoder: one for the keys (dot products) and one for the values (decoder input).

1 comments

> there have been a couple of attempts to use CNNs for translation already, but none of them outperformed big and well-tuned LSTM systems

It is true that QRNN had results on mostly small-scale benchmarks, but it seemed that Bytenet especially the second version had SOTA results both for language models with characters and for machine translation with characters on the same large-scale En-De WMT task that is used in this paper.

MT with characters, with regards to ordering, structure, etc, is potentially much harder than with words or word-pieces, since the encoded sequences are 5 or 6 times longer on average, and the meanings of words need to be built up from individual characters.

Yes, ByteNet v2 outperforms LSTMs on characters but not on word pieces. It would be interesting to see how our model performs on characters, especially when scaled up to the size of ByteNet (30+30 layers) and also how ByteNet performs on BPE codes. I think that character-level NMT is definitely interesting and worth investigating, but from a practical point of view it makes sense to choose a representation that achieves the maximum translation accuracy and speed.