| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by reerdna 692 days ago

For use in retrieval/RAG, an emerging paradigm is to not parse the PDF at all.

By using a multi-modal foundation model, you convert visual representations ("screenshots") of the pdf directly into searchable vector representations.

Paper: Efficient Document Retrieval with Vision Language Models - https://arxiv.org/abs/2407.01449

Vespa.ai blog post https://blog.vespa.ai/retrieval-with-vision-language-models-... (my day job)

5 comments

attilakun 692 days ago

I do something similar in my file-renamer app (sort.photos if you want to check it out):

1. Render first 2 pages of PDF into a JPEG offline in the Mac app.

2. Upload JPEG to ChatGPT Vision and ask what would be a good file name for this.

It works surprisingly well.

link

qeternity 692 days ago

I'm sure this will change over time, but I have yet to see an LMM that performs (on average) as well as decent text extraction pipelines.

Text embeddings for text also have much better recall in my tests.

link

infecto 692 days ago

No multi-modal model is ready for that in reality. The accuracy from other tools to extract tables and text are far superior.

link

authorfly 692 days ago

You have detractors, but this is the future.

link

cpursley 692 days ago

Is anyone actually having success with this approach? If so, how and with what models (and prompts)?

link

distracted_boy 692 days ago

Claude.ai handles tables very well, at least in my tests. It could easily convert a table from a financial document into a markdown table, among other things.

link