| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pks016 548 days ago
	Question for people who spent more time with these small models. What's a current best small model to extract information from a large number of pdfs? I have multiple collection of research articles. I want two tasks 1) Extract info from pdfs 2) classify papers based the content of the paper. Or point me to right direction

1 comments

themanmaran 548 days ago

Hey this is something we know a lot about. I'd say Qwen 2.5 32B would be the best here.

We've found GPT-4o/Claude 3.5 to benchmark at around 85% accuracy on document extraction. With Qwen 72B at around 70%. Smaller models will go down from there.

But it really depends on the complexity of the documents, and how much information you're looking to pull out. Is it something easy like document_title or hard like array_of_all_citations.

link

pks016 548 days ago

Most of them are experimental studies. So it would be text extraction of something like title, authors, species of the study, sample size etc. And classify based on the content of the pdfs.

I tried the GPT-4o, it's good but it'll cost a lot if I want to process all the documents.

link

SparkyMcUnicorn 547 days ago

1. You can get a 50% discount via batching.

2. Give a few Sonnet or 4o input/output examples to haiku, 4o-mini, or any other smaller model. Giving good examples to smaller models can bring the output quality closer to (or on par with) the better model.

link