Hacker News new | ask | show | jobs
by hynky 171 days ago
HF recently released a blog about their pre-training dataset from PDFs.

Alongside that they also shared a graph how much PDFs they were able to fetch by date, and the old internet seems to be mostly dead.