| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ahmedhawas123 309 days ago

This may be a bit of an irrelevant and at best imaginative rant, but there is no shortage of solutions that are mediocre or near perfect for specific use cases out there to parse PDFs. This is a great addition to that.

That said, over the last two years I've come across many use cases to parse PDFs and each has its own requirements (e.g., figuring out titles, removing page numbers, extracting specific sections, etc). And each require a different approach.

My point is, this is awesome, but I wonder if there needs to be a broader push / initiative to stop leveraging PDFs so much when things like HTML, XML, JSON and a million other formats exist. It's a hard undertaking I know, no doubt, but it's not unheard of to drop technologies (e.g., fax) for a better technology.

2 comments

bm-rf 309 days ago

For the purposes of an llm "reading" a pdf, it just renders it as an image. The file format does not matter. Let's say you have documents that already exist, a robust ocr solution that can handle tables and diagrams could be very valuable.

link

mdaniel 309 days ago

That ship has sailed, and I'd guess the majority of the folks in these threads are in the same boat I am: one does not get to choose what files your customers send you, you have to meet them where they are

link