Hacker News new | ask | show | jobs
by dubbid 997 days ago
The idea is to also generally handle scanned documents as well. Besides sometimes text boxes can get very distorted with whitespace such that the boxes look to a computer very different then they do in new life.

In practice, you are right that this would be more efficient in many cases (not scanned, no weird whitespace), but in practice, the cost of OCR is so low compared to the LLM costs and the relative consistency of OCR outputs helps a lot means that I don't try to handle the PDF object extraction.

1 comments

Fair point :) And yes, some PDFs use weird ways to represent the spacing between words.