Hacker News new | ask | show | jobs
by ivanstegic 1112 days ago
This is a great idea and would love to see something like this succeed!

If I understand how all of these OpenAI dependent apps work, none of them actually have the LLM and are doing any kind of heavy processing. AFAIK, they’re all packaging your data, submitting it to OpenAI on every request and then repackaging the output. There’s no real indexing, no real tangible thing, you have to start from scratch every time. So it’s likely going to be very expensive and super slow.

Or am I wrong and I’ve missed something here?

2 comments

For most applications, packaging all the data and submitting it to OpenAI won't be feasible due to the limited token window size.

I think the most common design pattern nowadays goes like this:

1. Chunk all your data (e.g. per paragraph of content)

2. Generate an embedding for each chunk

3. Index embeddings in a vector database

4. When a query comes in, find chunks relevant to the query (based on embeddings similarity) and ONLY send the relevant chunks + query to a LLM to formulate the answer

Quickly glancing through the repository from this post, I can see that it also follows this pattern. It uses OpenAI's embedding API for 2. and Pinecone DB for 3.

I've seen this described as the common approach and argued for it but with my limited knowledge I have difficulties countering the argument that it would be best to just finetune the model with your own data.

I don't think it is as much the context window size because you would chunk your data anyways. I think the counter argument is either that finetuning is limited by the risk of overfitting and catastrophic forgetting or cost prohibitive. I think it is more of the former. Am I on the right track with this arguments?

Another point to consider is probably the vector DB contains an exact version of your data you get that as a result whereas the model will only be able answer vaguely or by paraphrasing.

In contrast, What’s the flow for training or fine tuning your own model
Once I dug in to the fine-tuning APIs [1] I realized that the phrase "training the model on your docs" often doesn't make sense for the use case people are trying to solve. You provide hundreds of input examples and tell the model how it should complete those prompts. Fine-tuning has a lot of use cases, but "keeping the LLM generally grounded in the facts of my website" is not one of them.

[1] https://platform.openai.com/docs/guides/fine-tuning/prepare-...

Fine-tuning has a lot of use cases, but "keeping the LLM generally grounded in the facts of my website" is not one of them.

Yes, that's what everyone says and it makes total sense to me. I'm looking for (technical, but not too technical) arguments why it is not possible. There I'm not so much interested in the "grounded in the facts of my website" point but more in the similar "take the data from my large private knowledge base into consideration" point.

In other words I don't want to restrict the knowledge the model has or the answers it gives. I want to add a considerable amount of my own knowledge. This seems not to be possible without training from scratch. The question is "Why?"

Thank you for this.
If you are interested in this approach, I found there are good examples using LangChain, so it's a good keyword to search for.
This seems to be mainly a wrapper around the OpenAI API From the repo they want to integrate Open Source LLMs in the future too.

I feel lately - GPT-4 is superb in performance, but locked up. Using a weaker model feels better because I can just spin up a server and run it on my own. Recent Twitter/Reddit changes remind that relying on others can be a bad thing.

Mentioned this in a preview reply but this was something Convostack wanted to solve by allowing anyone to integrate their Langchain agent with a production-ready chatbot. It's completely open-source and also has pre-built React UI components. As a disclaimer I helped work on the project but curious to hear what you guys think: https://github.com/ConvoStack/convostack
Yea, agree.