Hacker News new | ask | show | jobs
by dvt 811 days ago
I think you may be misunderstanding what fine tuning does. It does not teach the model new knowledge. In fact, Meta has a paper out that argues you only need a data set of 1000[1] to achieve pretty good alignment (fine-tuning) results. (100M is way overkill.) For knowledge retrieval, you need RAG (usually using the context window).

[1] https://arxiv.org/pdf/2305.11206.pdf

4 comments

This is not correct. Fine-tuning can absolutely add new knowledge to a model. It's been repeatedly demonstrated at this point.

LIMA demonstrated that instruction-tuning and output formatting could be trained with a limited number of samples, not that finetuning was incapable of adding new information to the model.

It may be sub-optimal in most cases to RAG, but it does work.

Do you have any good links to support the idea that this has been repeatedly demonstrated?

I've had trouble finding high quality sources of information about successful applications of fine-tuning to add knowledge to a model.

Here is a recent HN discussion of an article that talks about this. https://news.ycombinator.com/item?id=39748537

Anecdotally, I literally "added knowledge" to a model via fine-tuning earlier today.

Fine tuning can do extremely well given a specific question and answer, the tuned model "knows" how to answer that question much more accurately.

I gave it a specific question, and a good answer as a fine tuning input. (Literally 2 data points as the input, 2 questions/answer sets.)

I asked it that question, and the tuned model blows the base model away, for answering that specific question.

> I asked it that question, and the tuned model blows the base model away, for answering that specific question.

Validating on training data...What could possibly go wrong?

This thread reminds of a competition I once joined where we were supposed to fine-tune an LLM to fill out trivia answers, and we were expressly disallowed from training on the validation set.

However: we were allowed to pick any base model in a given repo. All of the teams that “won” did so for the same reason: they had all picked the same base model (whereas a majority of teams picked the given default), presumably the one that had at some point been trained on the most favorable data for this particular challenge.

It was quite silly. Had everyone had the same base model we’d have a bit more of an interesting problem (more around NLP and alignment than picking the ‘best’ model).

Well, in this case we're literally asking if the model can remember new facts, not generalize, so seems like a legit first level test; second level might be, can it answer a question incorporating that specific knowledge in a broader question.
Our findings are that RAG does not generalize well when critical understanding is shared over a large corpus of information. We do not think it is a question of either context length or retrieval. In our case it is very clearly capturing understanding within the model architecture itself.
Does that mean you tested on specific questions? Get 1-5 typical queries and test them with a properly configured llamaindex.

If your documents repeat the same information several different ways then you actually might get something out of LoRA on raw documents. But you need a way to measure it and you have to verify that RAG won't work with real tests first.

To do effective training with LoRA though and expect it to pick up most of the information reliably then you need to cover the knowledge and skills with multiple question answer pairs for each item you expect it to learn. Which you can then use QA pairs to validate that it learned those things.

But it's a lot of QA pair generation.

Depending on the application, you would do continued pretraining over new tokens to gain new knowledge. 100M tokens is applicable here.

You would fine-tune, certainly, for domain-specific tasks, and would curate a subset of the 100M tokens. Total tokens in alignment study references is 1,000,000.

RAG is a hacky way to interpolate new knowledge with a base model. Not always reliable nor easy to integrate into task-specific workflows.

When I first played with RAG I thought “wow this is so cool”. Now I’m starting to think it’s kinda useless, in the sense that the critical bit is the initial search, and that doesn’t use the LLM power, or at most it’s used to capture the user intent and reformulate the query.

We’re building some “smart search” functionality for some teams and I start to wonder if a traditional search results list (i.e. sans the LLM, or used only to rewrite the user query) with the document chunks wouldn’t be better than blindly taking the top N and feeding them to the LLM to produce some response.

E.g. we have some docs about specific supermarket chains, but the word “supermarket” might not appear at all in them, but the user query might be “show me what we have about supermarkets”. Now the embeddings hopefully will place the word “supermarket” close to, say, “Costco”, but they might also place it closer to “shopping center”, and we might have docs about shopping centers that could rank higher. So we might take the top 5 docs and send them to the LLM, but the docs the user was after might have been in 7th and 9th position, nowhere to be seen by the LLM nor the user.

I’ve worked in scaled enterprise search, both with lexical (lucene based, eg elastic search) & semantic search engines (vector retrieval).

Vector retrieval that isn’t contextualized in the domain is usually bad (RAG solutions call this “naive rag” … and make up for it with funky chunking and retrieval ensembles). Training custom retrievers and reranker is often key but quite an effort and still hard to generalize in a domain with broad knowledge.

Lexical based searching provides nice guarantees and deterministic control in results (depending on how you index). Certainly useful here is advanced querying capability. Constructing/enriching queries with transformers is cool.

Reranking is often nice ensemble additions, albeit can be done with smaller models.

> We’re building some “smart search” functionality for some teams and I start to wonder if a traditional search results list (i.e. sans the LLM, or used only ti rewrite the user query) with the document chunks wouldn’t be better than blindly taking the top N and feeding them to the LLM to produce some response.

Yep, it's a pretty common pattern: query -> embeddings -> vector db -> records -> context -> LLM -> result.

Yes that’s basically the RAG pattern, but I’ve edited my comment to elaborate a bit. I’m questioning what the LLM brings to the table vs just showing the search results (a long list not limited by context length) to the user.

The LLM doesn’t even get the full docs most of the time, just chunks. It has a very narrow view so its full power is not used.

Another approach is to take the user query, have the LLM guess the answer and use that guessed answer for the RAG step.
question: RAG by definition offloads the retrieval to a vector similarity search via embeddings db (faiss, knn et al).

what is the preferred way to feed documents / knowledge into a model so that the primary retrieval is done by the llm, and perhaps use vector db only for information enhancement (a la onebox)?