Hacker News new | ask | show | jobs
by simonw 4 hours ago
It uses fineweb, which is derived from Common Crawl, which is an unlicensed scrape of web pages.