Yes it requires an extremely large diverse training set for the first unsupervised stage (pre-training). Then fine tune it on the smaller data set. But we may need to wait for the next generation of LLMs that incorporate planning algorithms so that it can better stay focused on its goal for whatever tasks we are asking it to do for research purposes. Otherwise we end up with this https://news.ycombinator.com/item?id=38418974