Y
Hacker News
new
|
ask
|
show
|
jobs
by
andy99
207 days ago
It says it’s common crawl, I interpret it to mean this is a generic web scrape dataset, presumably they filter stuff out they don’t want before pretraining. You’d have to do do some ablation testing to know what value it adds
1 comments
ccgreg
204 days ago
Common Crawl is a particular dataset. commoncrawl.org
link