| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mvac 469 days ago

Great progress, but unfortunately, for our use case (converting medical textbooks from PDF to MD), the results are not as good as those by MinerU/PDF-Extract-Kit [1].

Also the collab link in the article is broken, found a functional one [2] in the docs.

[1] https://github.com/opendatalab/MinerU [2] https://colab.research.google.com/github/mistralai/cookbook/...

2 comments

owenpalmer 469 days ago

I've been searching relentlessly for something like this! I wonder why it's been so hard to find... is it the Chinese?

In any case, thanks for sharing.

link

thelittleone 469 days ago

Have you had a chance to compare results from MinerU vs LLM such a Gemini 2.0 or anthropic's native PDF tool?

link

mvac 468 days ago

Yes, i have. The problem with using just an LLM is that while it reads and understands text, but it cannot reproduce it accurately. Additionaly the textbooks I've mentioned have many diagrams and illustrations in them (e.g. books on anatomy or biochemistry). I don't really care about extracting text from them, I just need them extracted as images alongside the text, and no LLM does that.

link