Hacker News new | ask | show | jobs
by pistachiopro 846 days ago
LLMs are being trained on a smaller and smaller percentage of human prose. Right now it seems like code is the best source for the bulk of an LLM's diet, but it's also looking likely that synthetic math text will be even better. The structured reasoning of code and math seems to be what actually makes these big LLMs "smart." Once you've trained a smart LLM, it seems to take a relatively small amount of hand-curated human prose to fine tune it into talking like a human. Unfortunately this article feels like the wishful thinking of someone who is afraid of the changes LLMs are bringing and hasn't done much research.
1 comments

It seems to me like archive.org and the major book publishers are sitting on a gold mine(at least up to 2022), but I haven't seen anyone saying the same, so maybe I just don't know enough about LLM.