| HN Mirror

I disagree here: Setting up a large-scale pretraining run is super complex if you have to manage your distributed computing platform, but looking at how the training data looks like and is fed into an LLM is not that complex. If you are developing a product based on or with LLMs, it's worth spending a few hours to understand it on the big-picture level. I mean, look at how many people are confused why LLMs a) hallucinate facts, b) sometimes copy text passages verbatim, c) why they probably shouldn't be used as scientific calculators etc. All that could be much more clear if you know how they are trained.