Hacker News new | ask | show | jobs
by PeterisP 971 days ago
The general idea is that you have a particular task & dataset, and you optimize these vectors to maximize that task. So the properties of these vectors - what information is retained and what is left out during the 'compression' - are effectively determined by that task.

In general, the core task for the various "LLM tools" involves prediction of a hidden word, trained on very large quantities of real text - thus also mirroring whatever structure (linguistic, syntactic, semantic, factual, social bias, etc) exists there.

If you want to see how the sausage is made and look at the actual algorithms, then the key two approaches to read up on would probably be Mikolov's word2vec (https://arxiv.org/abs/1301.3781) with the CBOW (Continuous Bag of Words) and Continuous Skip-Gram Model, which are based on relatively simple math optimization, and then on the BERT (https://arxiv.org/abs/1810.04805) structure which does a conceptually similar thing but with a large neural network that can learn more from the same data. For both of them, you can either read the original papers or look up blog posts or videos that explain them, different people have different preferences on how readable academic papers are.