Hacker News new | ask | show | jobs
by leobg 300 days ago
If I have 1,000 labeled examples for a classification task, I’ll expand that into a training dataset using augmentation, and then finetune a small model like RoBERTa. It’s fast, cheap, accurate — and predictable.

Others have had success with SetFit as the training framework and Ettin as the base model.

1 comments

oh this seems like an interesting idea, what tactics do you use for augmentation? For my own use-case, I think I could reorder semantic chunks, or maybe randomly delete pieces, but curious what tactics you use!

I have also considered training a small language model for synthetic data generation.

Yes, exactly. You want to randomize the parts that are irrelevant. For example, if you're classifying news articles, you may want to shorten them anyway. A human would be able to tell what category an article belongs to without reading the whole thing - so may do a combination of URL, headline, beginning, middle, and/or end. And if you do that, it's easy to turn one training example into 10 or more. You just vary the length of the individual parts.