Hacker News new | ask | show | jobs
by solomatov 2683 days ago
Here's the quote from BERT:

>We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.