| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 343242dfsdf 713 days ago

>the data is the limit

Most people don't know exactly how the dataset is "fed" to the training pipeline, but with the current state-of-the-art you can say the feeding is like when a human reads aloud a text, not re-reading ever, not a single word.

And then you're asked about what words you've read most often, their order and how many times they appeared in the text. Then, with those numbers you just gave, some probabilities are calculated and anotated, and there you get a "token".

There are obvious improvements plausible to be applied to that basic processing, and most are being applied already, but there's plenty of room for evern further improvement apparently.

Claude says that the previous text could be described like "a simplified metaphor of model training", so you're warned about simplicity.