| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by leobg 300 days ago
	If I have 1,000 labeled examples for a classification task, I’ll expand that into a training dataset using augmentation, and then finetune a small model like RoBERTa. It’s fast, cheap, accurate — and predictable. Others have had success with SetFit as the training framework and Ettin as the base model.

1 comments

coder68 300 days ago

oh this seems like an interesting idea, what tactics do you use for augmentation? For my own use-case, I think I could reorder semantic chunks, or maybe randomly delete pieces, but curious what tactics you use!

I have also considered training a small language model for synthetic data generation.

link

leobg 296 days ago

Yes, exactly. You want to randomize the parts that are irrelevant. For example, if you're classifying news articles, you may want to shorten them anyway. A human would be able to tell what category an article belongs to without reading the whole thing - so may do a combination of URL, headline, beginning, middle, and/or end. And if you do that, it's easy to turn one training example into 10 or more. You just vary the length of the individual parts.

link