Hacker News new | ask | show | jobs
by gdiamos 247 days ago
Just coming out of founding one of the first LLM fine tuning startups - Lamini - I disagree

Our thesis was that fine tuning would be easier than deep learning for users to adopt because it was starting from a very capable base LLM rather than starting from scratch

However, our main finding with over 20 deployments was that LLM fine tuning is no easier to use than deep learning

The current market situation is that ML engineers who are good enough at deep learning to master fine tuning can found their own AI startup or join Anthropic/OpenAI. They are underpaid building LLM solutions. Expert teams building Claude, GPT, and Qwen will out compete most users who try fine tuning on their own.

RAG, prompt engineering, inference time compute, agents, memory, and SLMs are much easier to use and go very far for most new solutions

2 comments

Will Anthropic/OpenAI really hire anyone who can fine-tune an LLM?
They will hire anyone who can produce a model better than GPT5, which is the bar for fine tuning

Otherwise, you should just use gpt5

Preparing a few thousands training examples and pressing fine tune can improve the base LLM in a few situations, but it also can make the LLM worse at other tasks in hard to understand ways that only show up in production because you didn’t build evals that are good enough to catch them. It also has all of the failure modes of deep learning. There is a reason why deep learning training never took off like LLMs did despite many attempts at building startups around it.

Andrej karpathy has a rant about it that captures some of the failure modes of fine tuning - https://karpathy.github.io/2019/04/25/recipe/

> They will hire anyone who can produce a model better than GPT5, which is the bar for fine tuning

Depends on what you want to achieve, of course, but I see fine-tuning at the current point in time primarily as a cost-saving measure: Transfer GPT5-levels of skill onto a smaller model, where inference is then faster/cheaper to run. This of course slows down your innovation cycle, which is why generally this is imo not advisable.

I agree this is the main case where it makes sense.

But a recent trend that cut into the cost savings is that foundation model companies have started releasing small models. So you can build a use case with qwen 235B, then shrink down to 30B, or even all the way down to 0.6B if you really want to.

The smaller models lose some accuracy, but some use cases are solvable even by these smaller and much more efficient models.

> but it also can make the LLM worse at other tasks

The problem is easily avoided by not using it for other tasks.

Users often found it hard to know exactly where the boundaries are.

This is a reason why general purpose models shine. You don’t have to carefully characterize a task and put guard rails around it.

There is also a reason why you don’t have general purpose applications. Most users understand that Excel is for data tables and Paint is for images even though some people have fun playing with the boundary and creating Excel paintings.
This is exactly the intuition that leads to excitement about fine tuning.

However, I personally think that this intuition applies to products and interfaces, not to AI.

Intelligence and learning is general. Intelligence without generalization is memorization, which seems to be less useful in practice.

Interesting you bring up Excel. ChatGPT's chat interface is going to be Excel for the AI era. Everyone knows there's a better interface to be had, but it just works.
It’s quite easy to produce a model that’s better than GPT-5 at arbitrarily small tasks. As of right now, GPT-5 can’t classify a dog by breed based on good photos for all but the most common breeds, which is like an AI-101 project.
Try doing a head to head comparison using all LLM tricks available including prompt engineering, rag, reasoning, inference time compute, multiple agents, tools, etc

Then try the same thing using fine tuning. See which one wins. In ML class we have labeled datasets with breeds of dogs hand labeled by experts like Andrej, in real life users don’t have specific, clearly defined, and high quality labeled data like that.

I’d be interested to be proven wrong

I think it is easy for strong ML teams to fall into this trap because they themselves can get fine tuning to work well. Trying to scale it to a broader market is where it fell apart for us.

This is not to say that no one can do it. There were users who produced good models. The problem we had was where to consistently find these users who were willing to pay for infrastructure.

I’m glad we tried it, but I personally think it is beating a dead horse/llama to try it today

There are tons of problems this simply doesn’t apply to. In the limited API world this may be true but agents are far from reliable
I mean, at the point where you’re writing tools to assist it, we are no longer comparing the performance of 2 LLMs. You’re taking a solution that requires a small amount of expertise, and replacing it with another solution that requires more expertise, and costs more. The question is not “can fine tuning alone do better than every other trick in the book plus a SOTA LLM plus infinite time and money?” The question is: “is fine tuning useful?”
Fair didn’t seem to matter to users who just wanted to build solutions with reasonable time and budget
If your customers can't fine tune, do it for them instead.
How can you hire enough people to scale that while making the economics work?

Why would they join you rather than founding their own company?

Yup! That's why civit.ai doesn't exist right?

They'll pay for anyone that can personalize models to be meaningfully diverse.

I think you misunderstand what they are saying - doing a good job of fine tuning is difficult.

Training an LLM from scratch is trivial - training a good one is difficult. Fine tuning is trivial - doing a good job is difficult. Hitting a golf ball is trivial - hitting a 300 yard drive down the middle of the fairway is difficult.

What models did you try to find tune? Were the models at the time even good enough to fine tune? Did they suffer from catastrophic forgetting?

We have a lot of more capable open source models now. And my guess is that if you designed models specifically for being fine tuned, they could escape many of the last generation pitfalls.

Companies would love to own their own models instead of renting from a company that seeks to replace them.

We used the best models available and went from the Pythia/gpt2 to Deepseek generations.

One annoying part was switching to new and better models that came out literally every week.

I don’t think it substantially changes anything. If anything I think the release of more advanced models like qwen-next makes things like fp4, moe, and reasoning tokens an even higher barrier of entry.