|
|
|
|
|
by barfbagginus
763 days ago
|
|
The way they vectorized the PDF could be less efficient than simply extracting the text and dropping it into context as text. If it's a 100 MB PDF then it's probably a scanned PDF, and OpenAI is probably using an OCR model to vectorize each page directly. It seems an opaque process with room to be inefficient. So I would be interested to know if we could save on token/vector fees by preprocessing the PDF to text with our own OCR. |
|