| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by throwaway4496 324 days ago
	So you parse PDFs, but also OCR images, to somehow get better results? Do you know you could just use the parsing engine that renders the PDF to get the output? I mean, why raster it, OCR it, and then use AI? Sounds creating a problem to use AI to solve it.

3 comments

daemonologist 324 days ago

Yes, but a lot of the improvement is coming from layout models and/or multimodal LLMs operating directly on the raster images, as opposed to via classical OCR. This gets better results because the PDF format does not necessarily impart reading order or semantic meaning; the only way to be confident you're reading it like a human would is to actually do so - to render it out.

Another thing is that most document parsing tasks are going to run into a significant volume of PDFs which are actually just a bunch of scans/images of paper, so you need to build this capability anyways.

TL;DR: PDFs are basically steganography

link

throwaway4496 324 days ago

Hard no.

LLMs aren't going to magically do more than what your PDF rendering engine does, rastering it and OCR'ing doesn't change anything. I am amazed at how many people actually think it is a sane idea.

link

protomikron 324 days ago

I think there is some kind of misunderstanding. Sure, if you get somehow structured, machine-generated PDFs parsing them might be feasible.

But what about the "scanned" document part? How do you handle that? Your PDF rendering engine probably just says: image at pos x,y with size height,width.

So as parent says you have to OCR/AI that photo anyway and it seems that's also a feasible approach for "real" pdfs.

link

throwaway4496 324 days ago

Okay, this sounds like "because some part of the road is rough, why don't we just drive in the ditch along the road way all the way, we could drive a tank, that would solve it"?

link

Macha 324 days ago

My experience is that “text is actually images or paths” is closer to the 40% case than the 1% case.

So you could build an approach that works for the 60% case, is more complex to build, and produces inferior results, but then you still need to also build the ocr pipeline for the other 40%. And if you’re building the ocr pipeline anyway and it produces better results, why would you not use it 100%?

link

yxhuvud 324 days ago

Well, you clearly hasn't parsed a wide variety of pdfs. Because if you had, you had been exposed to pdfs that contain only images, or those that contain embedded text, but that embedded text is utter nonsense and doesn't match what is shown on the page when rendered.

And that is before we even get into text structure, because as everyone knows, reading text is easier if things like paragraphs, columns and tables are preserved in the output. And guess what, if you just use the parsing engine for that, then what you get out is a garbled mess.

link

throwaway4496 322 days ago

If your rendering engine doesn't output what is shown, your engine is broken, and it can be broken whatever you render it into bitmap or structured data.

link

diptanu 324 days ago

We parse PDFs to convert them to text in a linearized fashion. The use case for this would be to use the content for downstream use cases - search engine, structured extraction, etc.

link

throwaway4496 324 days ago

None of that changes the fact that to get a raster, you have to solve the PDF parsing/rendering problem anyways, so might as well get structured data out instead of pixels so that it now another problem (OCR).

link