Hacker News new | ask | show | jobs
by ccgreg 53 days ago
I don't know of anyone who uses Common Crawl as pre-training data without filtering it. We have an annotation system that lets people pick and choose which subsets they'd like to use.