|
|
|
|
|
by dwagnerkc
872 days ago
|
|
That post also very helpfully links to another paper they published alongside the OLMo paper just on the dataset. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research https://arxiv.org/abs/2402.00159 |
|