| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by solomatov 2683 days ago
	Here's the quote from BERT: >We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.