| HN Mirror

Since the training is done on the tasks of masked word prediction and contiguous sentence prediction, I'd suggest about a million sentences (from the same domain), with an average token length of 7 per sentence. Longer sentences would definitely help, as BERT uses the transformer encoder architecture which has multi head attention. This would enable the model to do better contextual representation learning for the embedding layer.