|
|
|
|
|
by microtonal
2018 days ago
|
|
That's only for pretraining a model. Very few groups pretrain models, since it is so expensive in terms of GPU time. Finetuning a model for a specific task typically only takes a few hours. E.g. I regularly train multitask syntax models (POS tagging, lemmatization, morphological tagging, dependency relations, topological fields), which takes just a few hours on a consumer-level RTX 2060 super. Unfortuntaly, distillation of smaller models can take a fair bit of time. However, there is a lot of recent work to make distillation more efficient, e.g. by not just training on the label distributions of of a teacher model, but by also learning to emulate the teacher's attention, hidden layer outputs, etc. |
|