Hacker News new | ask | show | jobs
by snyhlxde 768 days ago
Thanks for interesting in our work! Yes we found training with consistency loss + AR loss on even a subset of a dataset results in significant speedup (0.01% pre-training cost). Training on more data permits even further speedup: the model is able to learn from more frequently-appearing collocations and phrases.

For more details, please check out our paper and you can also see speedup saturates as the size of training data grows.