Hacker News new | ask | show | jobs
by 317070 462 days ago
I've been finetuning these models since before chatGPT, and the one lesson I've learned is that by the time you have set up everything to fine-tune a model, you can expect a newer model to do as well with prompt-tuning.

So, unless you hope to stay at the fore front (e.g. to be ahead of competitors), there has been no real reason to finetune for the last 4 years, at best you could hope to stay about 1-3 months ahead, depending on how fast you were at setting up your training. And if that is what you did hope to achieve, you needed to automate on a higher level, i.e. automate data collection and the collection of eval cases.

3 comments

It feels like there should be a service where I just drag drop a folder of examples and it fine tunes the latest DeepSeek or whatever for me and even can host it for me at some cost. I'd pay for that immediately, but last I checked there was nothing that really did that well (would love to be wrong).
There are some options out there, depending on what type of task you're trying to fine tune. I think RL finetuning for DeepSeek e.g. isn't well developed yet, but you can finetune a small LLama model (~3B params) for classification or extraction tasks and it works really well. What sort of tasks were you looking at finetuning for?
Code generation or question answering. But ideally 70+B
Vibe coding has taken over for frontend dev, but outside that narrow band of very visible coding, most models aren't great at more esoteric programming languages. Even Swift gives Claude trouble. So the reason to fine-tune is simply that the best newest models still remain bad at things outside their comfort zone (how human).
I take my quip both ways, so I would wager that even with finetuning, these models are only 1 generation ahead in esoteric language performance and therefore _still not very good_. Am I correct?
Wanting it to be bad reeks of copium.
Why would I want it to be bad? I'm afraid I don't understand what you mean.
you wrote, emphatically, that it would be "still not very good". Why do you believe that it would be still not very good after training on a specific problem? LLMs aren't able to do things outside their training data, as vast as it is, but if it's in it's training data, why are you emphatic that it's still not very good? If I ask it to make something that it just needs to copy out sample code of, it would be pretty good at that one very specific task to me.
I feel like this is true but would be great if you could provide examples so we could get a better idea of why you think/know this.
I work for DeepMind on project Astra. Not to dwell too deep into confidentiality of what capabilities I have been looking at, but it has been the theme since the flamingo model that you only gain about 1 model-generation by fine-tuning versus prompt-tuning.