Hacker News new | ask | show | jobs
by byteknight 673 days ago
This seems like cool work but with a ton of "marketing hype speak" that immediately gets watered down by the first paragraph.

Ordering of statements.

1. (Title) Classifying all of the pdfs on the internet

2. (First Paragraph) Well not all, but all the PDFs in Common Crawl

3. (First Image) Well not all of them, but 500k of them.

I am not knocking the project, but while categorizing 500k PDFs is something we couldnt necessarily do well a few years ago, this is far from "The internet's PDFs".

2 comments

Moreover, the classification was not done on 500,000 PDF files themselves, but rather on the metadata of those 500,000 PDFs.
Overpromise with headline, underdeliver on details.