Hacker News new | ask | show | jobs
by icyfox 240 days ago
Yes - garbage in / garbage out still holds true for most things when it comes to LLM training.

The two bits about this paper that I think are worth calling out specifically:

- A reasonable amount of post-training can't save you when your pretraining comes from a bad pipeline; ie. even if the syntactics of the input pretrained data are legitimate it has learned some bad implicit behavior (thought skipping)

- Trying to classify "bad data" is itself a nontrivial problem. Here the heuristic approach of engagement actually proved more reliable than an LLM classification of the content

1 comments

Yes but the other interesting bit which is not clearly addressed is that increasing the garbage in to 100% does not result in absolute garbage out. So visibly there is still something to learn there.