| How are you solving for PDFs that are too large to fit in the token context? I know of a few approaches for that: - Ignore the problem and let it hallucinate answers to anything that's not in the first 5-10 pages - Attempt to recursively summarize the PDF at the start - so summarize e.g. pages 1-3, then 4-6 etc, then if the resulting summaries are still too long for the context window run a summary of those summaries. Use the summary in the context to help answer the user's questions. - Implement a mechanism for finding the most likely subset of the PDF content to include in the prompt based on the user's question. You could use the LLM to extract likely search terms, then run a dumb search for those terms and include the surrounding text in the prompt - or you could calculate embeddings on the different sections of the document and do a semantic search against it to find the most appropriate sections, as I did in https://simonwillison.net/2023/Jan/13/semantic-search-answer... Which approach did you use? Am I missing any options here? |
> In the analyzing step, ChatPDF creates a semantic index over all paragraphs of the PDF. When answering a question, ChatPDF finds the most relevant parapgrahs from the PDF and uses the ChatGPT API from OpenAI to generate an answer.
Are you using OpenAI's embeddings to implement that?