| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by microtonal 2064 days ago
	That's only for pretraining a model. Very few groups pretrain models, since it is so expensive in terms of GPU time. Finetuning a model for a specific task typically only takes a few hours. E.g. I regularly train multitask syntax models (POS tagging, lemmatization, morphological tagging, dependency relations, topological fields), which takes just a few hours on a consumer-level RTX 2060 super. Unfortuntaly, distillation of smaller models can take a fair bit of time. However, there is a lot of recent work to make distillation more efficient, e.g. by not just training on the label distributions of of a teacher model, but by also learning to emulate the teacher's attention, hidden layer outputs, etc.

1 comments

ericd 2064 days ago

Is there one model that you use more frequently than others as a base for these disparate fine tuning tasks? Basically, are there any that are particularly flexible?

link

ma2rten 2064 days ago

In general, BERT would be the most common one. RoBERTa is the same model but trained for longer, which turns out to work better. T5 is a larger model, which works better on many tasks but is more expensive.

link

ericd 2063 days ago

Thanks for the summary! I'm familiar with BERT, but less so the different variants, so that's quite helpful. I'll take a look at how RoBERTa works.

link

microtonal 2064 days ago

So far, of the models that run on GPUs with 8-16GiB VRAM XLM-RoBERTa has been the best for these specific tasks. It worked better than the multi-lingual BERT model and language-specific BERT models by quite a wide margin.

link

ericd 2063 days ago

Great, thanks very much for the pointer, especially the VRAM context - I'm looking to fine-tune on 2080Ti's rather than V100/A100s, so that's really good to know.

link