Hacker News new | ask | show | jobs
by iamflimflam1 764 days ago
There isn’t really any other way for this to work. The only way for the model to answer questions on your pdf is for the information to be somewhere in the prompt.
2 comments

That might be true of specific models or specific APIs for accessing them, but I’d argue isn’t even remotely true of neural networks generally or generatively-pretrained decoder-only attention-inspired language models in particular.

Ideally if you want a model’s weights to include a credible representation of non-trivial data you want it somewhere in the training pipeline (usually earlier is better for important stuff but that’s a hubristic at best), but there’s transfer learning of various kinds, and joint losses of countless kinds (CLIP in SD-style diffusors come to mind), and fine tunes (if that doesn’t just count as transfer learning), and dimensionality reduction that is often remarkably effective, and multi-tower models like what evolved into DLRM, and I’m forgetting/omitting easily 100x the approaches I mentioned.

It’s possible I misunderstand you, so please elaborate if so?

The way they vectorized the PDF could be less efficient than simply extracting the text and dropping it into context as text. If it's a 100 MB PDF then it's probably a scanned PDF, and OpenAI is probably using an OCR model to vectorize each page directly. It seems an opaque process with room to be inefficient. So I would be interested to know if we could save on token/vector fees by preprocessing the PDF to text with our own OCR.
No, it is not a scanned PDF but a standard textual PDF with tables, bullet points, chapters, etc. Somewhat like a manual.