| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by angadsg 1153 days ago

IMO folks are better off deploying their own version where they can adjust a few knobs (e.g. split chunk size) to get better results, given that PDF Q&A is such a commodity application.

Wrote a <50 lines version with LangChain to run on your terminal with any folder full of PDF documents - https://github.com/angad/dharamshala/blob/main/docs.py

return_source_documents is particularly helpful to get a sense of what is being sent in the prompt.

3 comments

cs702 1153 days ago

Consider adding a bit of overlap to the text chunks. Say, 300 tokens:

  text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=300)

Otherwise, you'll likely end up with too many edge cases in which only part of a relevant context is retrieved :-)

link

jcutrell 1153 days ago

This is actually pretty insightful - I have done something similar with splitting my obsidian data into chunks using paragraphs and headers as demarcation, but this solves a more interesting problem of nuance! I like it.

link

summarity 1153 days ago

If you're interested in improved chunking, I mentioned a few strategies in my talk here (timestamp linked, <1min): https://youtu.be/elNrRU12xRc?t=536 that I used when building https://findsight.ai

link

cs702 1153 days ago

If you're already splitting documents by paragraph, consider using (as much as possible of) the previous and next paragraphs as overlap.

link

sergiotapia 1153 days ago

We did chunks with a sliding window of previous page + current page + next page, with overlaps. That produced the best results.

link

chaxor 1153 days ago

This would be much more useful if it used vicuna or you could select a different model

link

dabedee 1153 days ago

The link to your repo is returning a 404 now, whereas I could see it just a min ago.

link