| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by svieira 727 days ago

> You tell it to write a program (which is easy to audit) to pull the info from the PDFs

Wherein you discover that unless you ask it to consider the fact that PDFs are ... very hard to parse [1] [2] you get something that misses whole blocks of text or turns them into something they aren't and the rest of the program misses chunks of the document.

[1]: https://news.ycombinator.com/item?id=22473263 [2]: https://web.archive.org/web/20200303102734/https://www.filin...

1 comments

bongodongobob 726 days ago

Why are you expecting they are all very different? They're all likely very similar.

link

svieira 726 days ago

Because presuming that all of them are produced by the same utility is a _presumption_. They could be - but they could also be produced by many different vendors using many different methods all of them simply conforming to the specification "a PDF with HIGH LEVEL DESCRIPTION OF THE DATA".

link