| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Teleoflexuous 686 days ago

My use case is research papers. That means very clear text, combined with graphs of varying form and quality and finally occasional formulas.

Two approaches I had most, but not full, success with are: 1) converting to image with pdf2image, then reading with pytesseract 2) throwing whole pdfs into pypdf 3) experimental multimodal models

You can get more if you make content more predictable (if you know this part is going to be pure text just put it in pypdf, if you know this is going to be a math formula explain the field to the model and have it read it back for high accessibility needs audience) the better it will go, but it continues to be a nightmare and a bottleneck.

3 comments

freethejazz 675 days ago

Depending on how much structure you want to extract before passing the pdf contents to the next step in your pipeline, this paper[1] might be helpful in surfacing more options. It's a review/benchmark of numerous tools applied to the information extraction of academic documents. I haven't been through to evaluate the solutions they examined, but it's how I discovered GROBID and IMO lays out the strengths of each approach clearly.

[1] https://arxiv.org/pdf/2303.09957

link

authorfly 686 days ago

I have great news I wish someone delivered to me when I was in your shoes - try "GROBID". It parses papers into objects with abstract/body/figures! It will help you out a great deal. It is designed for papers and can extract the text almost flawlessly, but also give information on graphs for separate processing. I have several years experience with academic text processing (including presentations) working with an Academic Publisher if I could be helpful to anything?

link

Teleoflexuous 686 days ago

I have no idea how did I miss them last time I was looking around, unless they grew significantly over last half a year or so. I'll check it out when I get back to this project, thanks.

I wish I was hiring, if that's what you're asking ;) Otherwise, if you have any ideas for processing formulas (even just for reading them out, but any extra steps towards expressing what they mean - ' 'sum divided by count' is 'mean'/'average' value ' being the most simple example I can think of) I'd love to hear them. Novel ideas in technical papers are often expressed with formulas which aren't that complicated conceptually, but are critical to understanding the whole paper and that was another piece I was having very mixed results with.

link

authorfly 685 days ago

No worries. Sure, as to Formulas... I suspect many of them are LaTeX. If it is possible to parse that, it could help? At sufficient picture quality, vision models can accurately parse images of formulas to photos.

Neither will probably help you with a "readable" formula system because in my experience the readers that do this for LaTeX or normal formula text have flaws any way (it's also slightly cultural and dependent on field of study). Maybe the best bet is a prompt to a vision model with "read this formula out loud in a digestible, understandable concise way".. though this may have issues with the recall accuracy.

link

siamese_puff 685 days ago

Check out appjsonify for research papers

link