| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by regularfry 968 days ago
	It's not the inference, it's the training. They say in the paper: "We train with a batch size of 256 for a total of 80,000 optimisation steps, which amounts to eight epochs of training." That's a fair chunk of time. Mind you, `small.en` has smaller decoder layers than `medium.en`...