Hacker News new | ask | show | jobs
by simonw 1157 days ago
How are you solving for PDFs that are too large to fit in the token context?

I know of a few approaches for that:

- Ignore the problem and let it hallucinate answers to anything that's not in the first 5-10 pages

- Attempt to recursively summarize the PDF at the start - so summarize e.g. pages 1-3, then 4-6 etc, then if the resulting summaries are still too long for the context window run a summary of those summaries. Use the summary in the context to help answer the user's questions.

- Implement a mechanism for finding the most likely subset of the PDF content to include in the prompt based on the user's question. You could use the LLM to extract likely search terms, then run a dumb search for those terms and include the surrounding text in the prompt - or you could calculate embeddings on the different sections of the document and do a semantic search against it to find the most appropriate sections, as I did in https://simonwillison.net/2023/Jan/13/semantic-search-answer...

Which approach did you use? Am I missing any options here?

4 comments

The FAQ answers my question:

> In the analyzing step, ChatPDF creates a semantic index over all paragraphs of the PDF. When answering a question, ChatPDF finds the most relevant parapgrahs from the PDF and uses the ChatGPT API from OpenAI to generate an answer.

Are you using OpenAI's embeddings to implement that?

I don't know if this would work well for a lot of technical documentation I work with, it's written in a format similar to a software program, where you constantly have to flip back and forth between many pages to clearly decode what is being said.

For a simple example, a car manual where you want to change the brakes, it probably won't tell you in the brake section how to remove the wheels. You have to look at the wheel section. And in the wheel section it won't tell you about the nuts, you have to look in the spec sheets. And the spec sheet won't have the torque, you have to look in the chapter reference.

Often times they are not nice enough to point you to the relevant sections, you just have to stumble around the manual for a long time.

Yes, I wonder if there needs to be a level of recursion to solve for this problem:

1. User enters question 2. Semantic search for relevant sections of input material 3. Prompt LLM if it needs any further context to answer the question 4. GOTO 2 5. Finish

Yes, we're using OpenAI embeddings

- Mathis from ChatPDF

I can answer for my site (https://docalysis.com/) which does a semantic search to figure out which parts of the document are most relevant. Then you just use those parts.

Docalysis also shows you the PDF side-by-side, has page numbers, and overall responses are of better quality according to users that have emailed comparisons to ChatPDF.

tested and love it. thank you for creating. Another great feature is to allow multiple pdfs to talk to each other. can you help create that
Consider adding the ability to try your service before signup.
>Spotted this idea from Hassan Hayat: “don’t embed the question when searching. Ask GPT-3 to generate a fake answer, embed this answer, and use this to search”. See also this paper about Hypothetical Document Embeddings, via Jay Hack.

That is incredibly interesting. We really need an Internet-scale semantic search engine API to try out this and make interesting LLM-based tools. Hooking up LLMs to classic keyword search engines like Bing and Google often gives underwhelming results.

Chunk the PDF text and create embeddings. Get cosine similarity between user query and each chunk, and send the top N chunks to OpenAI that fit within token memory.
This is the way.