Hacker News new | ask | show | jobs
by rsmith49 2504 days ago
For fine tuning BERT onto a specific domain, what amount of text data would you recommend to train on?
1 comments

Since the training is done on the tasks of masked word prediction and contiguous sentence prediction, I'd suggest about a million sentences (from the same domain), with an average token length of 7 per sentence. Longer sentences would definitely help, as BERT uses the transformer encoder architecture which has multi head attention. This would enable the model to do better contextual representation learning for the embedding layer.