Hacker News new | ask | show | jobs
by mcfig 919 days ago
Why not?

(Is this about effectiveness, training time, forgetting, or something else?)

1 comments

Finetuning means adjusting parameters based on a smaller, specific dataset to tailor the LLM's responses, but the model's underlying knowledge is fixed at the point of its last training update. It adjusts how existing knowledge is used, but doesn't add new facts or information post-training. It's more about tweaking responses, biases, and style, rather than updating its factual database.

RAG combines a language model like GPT with a real-time search component. This allows the model to pull in information from external sources during its response generation process. Now the ability to access and integrate the most recent information is gained, which the language model alone wouldn't have.

This is splitting hairs, and pragmatically speaking not wholly accurate. It may even be completely incorrect, depending on your definitions.

You can think of an LLM as a set of basis vectors for human knowledge. If I feed in a PR training manual that is not in its dataset, it nevertheless figures out “hey I can make a reasonable approximation of this by combining X, Y, and Z” where X, Y, and Z are things it learned form its training set. In other words it maps the input into a vector representation based on its training data.

But in linear algebra two mappings can represent the same vector, just using different basis, so long as the vector space for the two basis are equal (or at least one is a subspace of the other). That's essentially what's going on here. A LLM builds a vector space on top of all human knowledge. If its parameters and training set are large enough, then the basis is in fact sufficient for representing anything you might throw at it. It will represent it in terms of it training set, yes, but that representation is high fidelity enough to represent the document in its entirety.

Fine-tuning a model is essentially rebalancing the initial weights of the LLM to pay special attention to certain clusters in its vector space, represented by the fine-tuning data. It's as if I threw random 2D points at a machine learning algorithm and it learned the basis { (1, 0), (0, 1) } representing the x-axis and y-axis. As a consequence of how inference works, it may then end up preferring to generate points when asked which are nearer to one axis or the other.

But then I fine-tune it on points that are distributed along the diagonal. This is not representative of the original training data, but NOT "outside" the original data. These points are fully represented by a linear combination of the x- and y- basis vectors. Nevertheless, the fine-tuning trains the model to prefer points which have weights that are multiples of (1, 1) or (1, -1) when represented in the original model. In other words, points along the diagonals.

Pragmatically speaking, this is no different from doing a whole new training run on the diagonal points, except that it is much, much cheaper, and has the capacity to reuse whatever knowledge was learned in the first training run.

  "...all human knowledge..."
  
I am assuming in practice it was not trained on "ALL human knowledge", particularly in the case of our private company knowledgebase.
It's a big presumption to assume that your private company knowledgeable contains such unique information that it is positively unrepresentable using basis vectors derived from the terabytes of public domain data sets that went into training the LLM.
Agreed. But that has been my experience. Almost everything in this KB goes against conventional wisdom.