Hacker News new | ask | show | jobs
by dwagnerkc 872 days ago
That post also very helpfully links to another paper they published alongside the OLMo paper just on the dataset.

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

https://arxiv.org/abs/2402.00159