Hacker News new | ask | show | jobs
by regularfry 968 days ago
It's not the inference, it's the training. They say in the paper: "We train with a batch size of 256 for a total of 80,000 optimisation steps, which amounts to eight epochs of training." That's a fair chunk of time. Mind you, `small.en` has smaller decoder layers than `medium.en`...