|
|
|
|
|
by Teleoflexuous
686 days ago
|
|
My use case is research papers. That means very clear text, combined with graphs of varying form and quality and finally occasional formulas. Two approaches I had most, but not full, success with are:
1) converting to image with pdf2image, then reading with pytesseract
2) throwing whole pdfs into pypdf
3) experimental multimodal models You can get more if you make content more predictable (if you know this part is going to be pure text just put it in pypdf, if you know this is going to be a math formula explain the field to the model and have it read it back for high accessibility needs audience) the better it will go, but it continues to be a nightmare and a bottleneck. |
|
[1] https://arxiv.org/pdf/2303.09957