Hacker News new | ask | show | jobs
by adastra22 918 days ago
This is splitting hairs, and pragmatically speaking not wholly accurate. It may even be completely incorrect, depending on your definitions.

You can think of an LLM as a set of basis vectors for human knowledge. If I feed in a PR training manual that is not in its dataset, it nevertheless figures out “hey I can make a reasonable approximation of this by combining X, Y, and Z” where X, Y, and Z are things it learned form its training set. In other words it maps the input into a vector representation based on its training data.

But in linear algebra two mappings can represent the same vector, just using different basis, so long as the vector space for the two basis are equal (or at least one is a subspace of the other). That's essentially what's going on here. A LLM builds a vector space on top of all human knowledge. If its parameters and training set are large enough, then the basis is in fact sufficient for representing anything you might throw at it. It will represent it in terms of it training set, yes, but that representation is high fidelity enough to represent the document in its entirety.

Fine-tuning a model is essentially rebalancing the initial weights of the LLM to pay special attention to certain clusters in its vector space, represented by the fine-tuning data. It's as if I threw random 2D points at a machine learning algorithm and it learned the basis { (1, 0), (0, 1) } representing the x-axis and y-axis. As a consequence of how inference works, it may then end up preferring to generate points when asked which are nearer to one axis or the other.

But then I fine-tune it on points that are distributed along the diagonal. This is not representative of the original training data, but NOT "outside" the original data. These points are fully represented by a linear combination of the x- and y- basis vectors. Nevertheless, the fine-tuning trains the model to prefer points which have weights that are multiples of (1, 1) or (1, -1) when represented in the original model. In other words, points along the diagonals.

Pragmatically speaking, this is no different from doing a whole new training run on the diagonal points, except that it is much, much cheaper, and has the capacity to reuse whatever knowledge was learned in the first training run.

1 comments

  "...all human knowledge..."
  
I am assuming in practice it was not trained on "ALL human knowledge", particularly in the case of our private company knowledgebase.
It's a big presumption to assume that your private company knowledgeable contains such unique information that it is positively unrepresentable using basis vectors derived from the terabytes of public domain data sets that went into training the LLM.
Agreed. But that has been my experience. Almost everything in this KB goes against conventional wisdom.