Hacker News new | ask | show | jobs
by tracyhenry 1174 days ago
how do you decide what content on the page to index, and how to split them to fit the context window?

Amazing concept btw - would love to see more examples (like a chatbot for a more well-known site).

2 comments

It's pretty straightforward forward with LangChain and GPT-Index. There are lot of tutorials on the Internet for the same like this one https://youtu.be/9TxEQQyv9cE
I don't think chunking + embedding based retrieval is good enough. It's a good first draft for a solution, but the chunks are out of context, so the LLM could combine them in an unintended way.

Better to question each document separately and then combine the answers into one last LLM round. Even so, there might be inter-document context that is lost - for example looking at one document that depends on details in another one. Large collections of documents should be loaded up in multiple passes, as the interpretation of a document can change when encountering information in another document. Adding one single piece of information to a collection of documents could slightly change the interpretation of everything, that's how humans incorporate new information.

One interesting application of document-collection based chat bots is finding inconsistencies and contradictions in the source text. If they can do that, they can correctly incorporate new information.

I index everything. I don't pick and choose. Like I said, I do pre-processing to scrape the entire website content.

When the user asks, I try to get the relevant bits and answer the question based on that.