Hacker News new | ask | show | jobs
by deweller 673 days ago
Is it possible that the 8 TB is just the extracted text?
1 comments

No, the Safedocs dataset is unprocessed pdfs.