| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dwagnerkc 872 days ago

That post also very helpfully links to another paper they published alongside the OLMo paper just on the dataset.

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research