| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Dorialexander 977 days ago

Yes. I think we may have enough for "full finetuning" and erasing to a large extent the previous knowledge. But that's still very far off for pretraining.

"RomeGPT" is next on my list of Monad successors and to give you a general idea, we have on the order of tens of millions of words in classical Latin (and biggest source will… Augustine). There was a BERT Latin project that was able to collect roughly 500 million words in all with mostly early modern and modern Latin.

In comparison I'm currently part of a project to pretrain a French model and we need… 140 billion words.