| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by constantinum 749 days ago

The primary challenge is not just about harnessing AI for search; it's about preparing complex documents of various formats, structures, designs, scans, multi-layout tables, and even poorly captured images for LLM consumption. This is a crucial issue.

There is a 20 min read on why parsing PDFs is hell: https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

To parse PDFs for RAG applications, you'll need tools like LLMwhisperer[1] or unstructured.io[2].

Now back to your problem:

This solution might be an overkill for your requirement, but you can try the following:

To set things up quickly, try Unstract[3], an open-source document processing tool. You can set this up and bring your own LLM models; it also supports local models. It has a GUI to write prompts to get insights from your documents.[4]

[1] https://unstract.com/llmwhisperer/ [2] https://unstructured.io/ [3] https://github.com/Zipstack/unstract [4] https://github.com/Zipstack/unstract/blob/main/docs/assets/p...

2 comments

jszymborski 749 days ago

Apache Tika could help extract the relevant bits of PDFs, couldnt it?

https://tika.apache.org/

link

fooker 749 days ago

Modern LLMs are good enough at treating pdfs as images and groking the context.

Well, Claude and GPT-4 seem to be.

link