| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by stormfather 609 days ago

Do they ever do something like the following scenario?:

1. LLM is trained on everything 2. LLM classifies everything in training corpus as high / low quality 3. New (or same) LLM (re)trains on only high quality documents

I've read most web data is somewhat to absolutely useless, e.g. pages of stock quotes, and it seems easy for something like GPT-3 to classify that, and classifying it would take what... one extra epoch's worth of computation? And save much more computation downstream by shrinking the size of the training set.