Hacker News new | ask | show | jobs
by barfbagginus 763 days ago
The way they vectorized the PDF could be less efficient than simply extracting the text and dropping it into context as text. If it's a 100 MB PDF then it's probably a scanned PDF, and OpenAI is probably using an OCR model to vectorize each page directly. It seems an opaque process with room to be inefficient. So I would be interested to know if we could save on token/vector fees by preprocessing the PDF to text with our own OCR.
1 comments

No, it is not a scanned PDF but a standard textual PDF with tables, bullet points, chapters, etc. Somewhat like a manual.