|
|
|
|
|
by pks016
502 days ago
|
|
Question for people who spent more time with these small models. What's a current best small model to extract information from a large number of pdfs? I have multiple collection of research articles. I want two tasks 1) Extract info from pdfs 2) classify papers based the content of the paper. Or point me to right direction |
|
We've found GPT-4o/Claude 3.5 to benchmark at around 85% accuracy on document extraction. With Qwen 72B at around 70%. Smaller models will go down from there.
But it really depends on the complexity of the documents, and how much information you're looking to pull out. Is it something easy like document_title or hard like array_of_all_citations.