Hacker News new | ask | show | jobs
by evilduck 562 days ago
I think you're looking for OLMo, https://allenai.org/olmo
1 comments

They appear to use Common Crawl in the DCLM dataset. Just downloading Common Crawl is probably copyright infringement before we consider specific terms in the licenses. Arxiv papers have a mix of licenses with some not allowing commercial use.

If I got the sources right, it’s already illegal with just two sources they scraped. That’s why I want one on Gutenberg content that has no restrictions.