Hacker News new | ask | show | jobs
by snippyhollow 3321 days ago
paper: https://s3.amazonaws.com/fairseq/papers/convolutional-sequen...

code: https://github.com/facebookresearch/fairseq

pre-trained models: https://github.com/facebookresearch/fairseq#evaluating-pre-t...

1 comments

One logical continuation of adding more attention steps is to make decision of how many attention steps to take determined by the network ala "Adaptive Computation Time for Recurrent Neural Networks", are you planning to go in that direction?
One of my students tried something along these lines for Natural Language Inference (NLI) last year. [1] The results where not conclusive, but perhaps Machine Translation is a better target? My reason for believing this is that the specific dataset for NLI most likely does not require multiple steps of inference for most cases (you can get away with simple token overlap), while the decoder in MT does so since it is constrained to output a single token at each step.

[1]: https://arxiv.org/abs/1610.07647