Hacker News new | ask | show | jobs
by angadsg 1153 days ago
IMO folks are better off deploying their own version where they can adjust a few knobs (e.g. split chunk size) to get better results, given that PDF Q&A is such a commodity application.

Wrote a <50 lines version with LangChain to run on your terminal with any folder full of PDF documents - https://github.com/angad/dharamshala/blob/main/docs.py

return_source_documents is particularly helpful to get a sense of what is being sent in the prompt.

3 comments

Consider adding a bit of overlap to the text chunks. Say, 300 tokens:

  text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=300)
Otherwise, you'll likely end up with too many edge cases in which only part of a relevant context is retrieved :-)
This is actually pretty insightful - I have done something similar with splitting my obsidian data into chunks using paragraphs and headers as demarcation, but this solves a more interesting problem of nuance! I like it.
If you're interested in improved chunking, I mentioned a few strategies in my talk here (timestamp linked, <1min): https://youtu.be/elNrRU12xRc?t=536 that I used when building https://findsight.ai
If you're already splitting documents by paragraph, consider using (as much as possible of) the previous and next paragraphs as overlap.
We did chunks with a sliding window of previous page + current page + next page, with overlaps. That produced the best results.
This would be much more useful if it used vicuna or you could select a different model
The link to your repo is returning a 404 now, whereas I could see it just a min ago.