| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by iopq 941 days ago
	https://en.wikipedia.org/wiki/Corpus_Inscriptionum_Latinarum approximately 180,000 inscriptions

1 comments

xcv123 941 days ago

That only contains a few million tokens. Useless for pre-training an LLM from scratch. You would need to find billions of tokens.

link

iopq 941 days ago

It should be similar to this 1700s English model, probably trained on modern data to start and then at the end fitted to the smaller data set

link

xcv123 940 days ago

Yes it requires an extremely large diverse training set for the first unsupervised stage (pre-training). Then fine tune it on the smaller data set. But we may need to wait for the next generation of LLMs that incorporate planning algorithms so that it can better stay focused on its goal for whatever tasks we are asking it to do for research purposes. Otherwise we end up with this https://news.ycombinator.com/item?id=38418974

link