Hacker News new | ask | show | jobs
by wongarsu 1248 days ago
And we have about a million cuniform tablets, with maybe 10-100 words each, so we have a couple million words of text to fine tune the model with. Or maybe to use when training the next model, after all GPT can already speak multiple languages.
1 comments

Addendum: One possible challenge is that so far large lanuage models are trained on a large sample of all text that has been published, while what we have of cuniform is a decent sample of all text that has been written. Meaning most cuniform tablets are inventories, invoices, requests for payment, contracts, tablets from students practicing writing etc. Types of documents that are underrepresented in traditional training data.