| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nl 791 days ago
	Does the llamaindex PDF indexer correctly deal with multi-column PDFs? Most I've seen don't, and you get very odd results because of this.

2 comments

rspoerri 791 days ago

i've made quite good conversions from pdf to markdown with https://github.com/VikParuchuri/marker . it's slow but worth a shot. Markdown should be easily parseable by a rag.

i'm trying to get a similar system setup on my computer.

link

nl 791 days ago

This looks worth exploring, so thanks. The author has done a bunch of work beyond what PyMuPDF does on multicolumn layouts.

link

pierre 791 days ago

Locally you can choose pypdf or mupdf wich are good but not perfect. If you can send your data online llamaparse is quite good.

link

j45 791 days ago

Pulling the text out of the PDFs correctly and independently is correct.

link