Hacker News new | ask | show | jobs
by jkukul 1112 days ago
For most applications, packaging all the data and submitting it to OpenAI won't be feasible due to the limited token window size.

I think the most common design pattern nowadays goes like this:

1. Chunk all your data (e.g. per paragraph of content)

2. Generate an embedding for each chunk

3. Index embeddings in a vector database

4. When a query comes in, find chunks relevant to the query (based on embeddings similarity) and ONLY send the relevant chunks + query to a LLM to formulate the answer

Quickly glancing through the repository from this post, I can see that it also follows this pattern. It uses OpenAI's embedding API for 2. and Pinecone DB for 3.

2 comments

I've seen this described as the common approach and argued for it but with my limited knowledge I have difficulties countering the argument that it would be best to just finetune the model with your own data.

I don't think it is as much the context window size because you would chunk your data anyways. I think the counter argument is either that finetuning is limited by the risk of overfitting and catastrophic forgetting or cost prohibitive. I think it is more of the former. Am I on the right track with this arguments?

Another point to consider is probably the vector DB contains an exact version of your data you get that as a result whereas the model will only be able answer vaguely or by paraphrasing.

In contrast, What’s the flow for training or fine tuning your own model
Once I dug in to the fine-tuning APIs [1] I realized that the phrase "training the model on your docs" often doesn't make sense for the use case people are trying to solve. You provide hundreds of input examples and tell the model how it should complete those prompts. Fine-tuning has a lot of use cases, but "keeping the LLM generally grounded in the facts of my website" is not one of them.

[1] https://platform.openai.com/docs/guides/fine-tuning/prepare-...

Fine-tuning has a lot of use cases, but "keeping the LLM generally grounded in the facts of my website" is not one of them.

Yes, that's what everyone says and it makes total sense to me. I'm looking for (technical, but not too technical) arguments why it is not possible. There I'm not so much interested in the "grounded in the facts of my website" point but more in the similar "take the data from my large private knowledge base into consideration" point.

In other words I don't want to restrict the knowledge the model has or the answers it gives. I want to add a considerable amount of my own knowledge. This seems not to be possible without training from scratch. The question is "Why?"

Thank you for this.
If you are interested in this approach, I found there are good examples using LangChain, so it's a good keyword to search for.