| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by themanmaran 546 days ago

Hey this is something we know a lot about. I'd say Qwen 2.5 32B would be the best here.

We've found GPT-4o/Claude 3.5 to benchmark at around 85% accuracy on document extraction. With Qwen 72B at around 70%. Smaller models will go down from there.

But it really depends on the complexity of the documents, and how much information you're looking to pull out. Is it something easy like document_title or hard like array_of_all_citations.

1 comments

pks016 546 days ago

Most of them are experimental studies. So it would be text extraction of something like title, authors, species of the study, sample size etc. And classify based on the content of the pdfs.

I tried the GPT-4o, it's good but it'll cost a lot if I want to process all the documents.

link

SparkyMcUnicorn 546 days ago

1. You can get a 50% discount via batching.

2. Give a few Sonnet or 4o input/output examples to haiku, 4o-mini, or any other smaller model. Giving good examples to smaller models can bring the output quality closer to (or on par with) the better model.

link