|
|
|
|
|
by Eisenstein
199 days ago
|
|
> How many models are only trained on legal[0] data? None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is 1. Common crawl 2. Github 3. Wikipedia, Wikibooks 4. Reddit (pre-2023) 5. Semantic Scholar 6. Project Gutenberg * https://arxiv.org/pdf/2402.00159 |
|
https://huggingface.co/datasets/allenai/dolma
https://huggingface.co/models?dataset=dataset:allenai/dolma