Hacker News new | ask | show | jobs
by nickpsecurity 565 days ago
They appear to use Common Crawl in the DCLM dataset. Just downloading Common Crawl is probably copyright infringement before we consider specific terms in the licenses. Arxiv papers have a mix of licenses with some not allowing commercial use.

If I got the sources right, it’s already illegal with just two sources they scraped. That’s why I want one on Gutenberg content that has no restrictions.