Hacker News new | ask | show | jobs
by rockwotj 462 days ago
is anyone outside of the research labs fine tuning models for production use cases? I have been seeing more people just using foundational models off the shelf especially in light of a new advancement that seems to come every few months
8 comments

I've had trouble getting a great answer to this question - I ask it in various places every month or so, most recently here: https://nitter.net/simonw/status/1895301139819860202

On paper fine tuning smaller models can greatly reduce the cost for a specific task, but I've not heard many real-world success stories around that.

I think vision LLMs are one of the most interesting applications here - things like fine-tuning for better results extracting data from a specific paper form or report structure. Again, not many public examples of that.

Oh there's a lot! Some cool examples I see:

1. Codebases, docs, large corpses of internal datasets - fill in the middle, auto completion etc.

2. I know a tonne of financial institutions use fine-tuning for trading, real time data parsing headline analysis, signal creation etc

3. Distillation is also relatively common - taking outputs of a large model and distilling it to a small model

4. Accuracy increasing is the most important - not cost or latency - we find if you solve the finetuning life cycle ie continuous auto fine-tuning, data filtering, reinforcement learning via DPO, that works well!

5. Lots of organizations use DPO and preference fine-tuning to align models since they have tonnes of feedback data!

6. Yep vision fine-tuning! For eg medical diagnosis, docs, qa on pics etc

7. And obviously large model labs finetune all base models ie chatgpt4.5 is a finetune of a base model

8. Finally reasoning finetuning via GRPO is very cool! If you have inputs and outputs but no labelled cot in between, GRPO is the way to go! Custom reward functions by companies!

"Codebases, docs, large corpses of internal datasets"

I still haven't seen a convincing demo of using fine-tuning to "teach" a model new information from additional documents. I'd love to see one.

(Closest I've come to that is I heard a rumor that Jane Street have fine-tuned an LLM for OCaml)

Here is a small LLM I trained to output dollars and cents from a verbal numeric amount:

https://huggingface.co/TrevorJS/check-amount-deverbalizer-sm...

Vision LLMs are definitely an interesting application.

At Avy.ai we're running small (2B-7B, quantized) vision models as part of a Mac desktop application for understanding what someone is working on in the moment, to offer them related information and actions.

We found that the raw results in understanding the images with a light LORA fine tune are not substantially different -- but the ease of getting a small model to follow instructions in outputting structured data in response to the image and at the level of verbosity and detail we need is greatly enhanced with fine tuning. Without fine tuning the models on the smaller end of that scale would be much more difficult to use, not reliably producing output that matched what the consuming application expects

Was constrained decoding not enough to force the output to be in a specific format?
Using a grammar to force decoding say valid JSON would work, but that hasn't always been available in the implementations we've been using (like MLX). Solvable by software engineering and adding that to the decoders in those frameworks, but fine tuning has been effective without that work.

The bigger thing though was getting the models to have the appropriate levels of verbosity and detail in their ouput which fine tuning made more consistent.

We use multiple post-trained models in production, at scale at https://osmos.io
Have you published details of how you're doing that anywhere?

Could be a useful marketing strategy for you, given how starved we all are of information about successful fine tuning stories.

Things have been moving so fast that it’s honestly hard for a small team to do that in parallel.

I got to present at GCP Next about a part of this last year: https://www.youtube.com/watch?v=5QsM1K9ahtw

I’m presenting in one (and maybe two) sessions with more info on the training side this year.

Finetuning is easy and worthwhile, especially with LoRAs as these Unsloth demos do. The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable.

In practice prompt engineering and few-shot prompting with modern LLMs, due to their strong-and-only-getting-better-over-time prompt adherence, tends to be more pragmatic.

There are inference providers such as Together AI that will serve your LoRA adapters at no extra cost above the model price. Then, there’s basically no difference between using your fine-tuned model or an API model off the shelf (except for the benefits you get from fine-tuning).
This (Serverless LoRA providers) is what most people want even if they don't know it.
Yeah this big time. I haven’t found a solution that makes sense. Larger models are already good enough and so convenient.

When it’s more feasible to do inference on the client (browser or desktop) I can see SLMs popping up more common in production.

> The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable

It's not actually that expensive and hard. For narrow usecases, you can produce 4-bit quantized fine-tunes that perform as well as the full model. Hosting the 4-bit quantized version can be done on relatively low cost. You can use A40 or RTX 3090 on Runpod for ~$300/month.

For self-hosting I've been using https://tuns.sh which is a tunneling solution using SSH. It works great for prototyping and I've been using it to host open-webui
If you have the resources to fine tune, you have the resources to run inference on fine tuned model.

If you want to scale up and down on demand, you can just fine tune on openai and google cloud as well.

> If you have the resources to fine tune, you have the resources to run inference on fine tuned model.

I don't think that's true.

I can fine tune a model by renting a few A100s for a few hours, total cost in the double digit dollars. It's a one-time cost.

Running inference with the resulting model for a production application could cost single digit dollars per hour, which adds up to hundreds or even thousands of dollars a month on an ongoing basis.

This assumes that inference is needed 24/7.

That may or may not be true for use-cases that require asynchronous, bulk inference _and_ require some task-specific post-training.

FWIW, my approach towards tasks like the above is to

1. start with using an off-the-shelf LM API until

2. one figures out (using evals that capture product intent) what the failure modes are (there always are some) and then

3. post-train against those (using the evals)

I've been finetuning these models since before chatGPT, and the one lesson I've learned is that by the time you have set up everything to fine-tune a model, you can expect a newer model to do as well with prompt-tuning.

So, unless you hope to stay at the fore front (e.g. to be ahead of competitors), there has been no real reason to finetune for the last 4 years, at best you could hope to stay about 1-3 months ahead, depending on how fast you were at setting up your training. And if that is what you did hope to achieve, you needed to automate on a higher level, i.e. automate data collection and the collection of eval cases.

It feels like there should be a service where I just drag drop a folder of examples and it fine tunes the latest DeepSeek or whatever for me and even can host it for me at some cost. I'd pay for that immediately, but last I checked there was nothing that really did that well (would love to be wrong).
There are some options out there, depending on what type of task you're trying to fine tune. I think RL finetuning for DeepSeek e.g. isn't well developed yet, but you can finetune a small LLama model (~3B params) for classification or extraction tasks and it works really well. What sort of tasks were you looking at finetuning for?
Code generation or question answering. But ideally 70+B
Vibe coding has taken over for frontend dev, but outside that narrow band of very visible coding, most models aren't great at more esoteric programming languages. Even Swift gives Claude trouble. So the reason to fine-tune is simply that the best newest models still remain bad at things outside their comfort zone (how human).
I take my quip both ways, so I would wager that even with finetuning, these models are only 1 generation ahead in esoteric language performance and therefore _still not very good_. Am I correct?
Wanting it to be bad reeks of copium.
Why would I want it to be bad? I'm afraid I don't understand what you mean.
you wrote, emphatically, that it would be "still not very good". Why do you believe that it would be still not very good after training on a specific problem? LLMs aren't able to do things outside their training data, as vast as it is, but if it's in it's training data, why are you emphatic that it's still not very good? If I ask it to make something that it just needs to copy out sample code of, it would be pretty good at that one very specific task to me.
I feel like this is true but would be great if you could provide examples so we could get a better idea of why you think/know this.
I work for DeepMind on project Astra. Not to dwell too deep into confidentiality of what capabilities I have been looking at, but it has been the theme since the flamingo model that you only gain about 1 model-generation by fine-tuning versus prompt-tuning.
I have documents from the last 50 years that I need to digitalize, millions of them written in old Arabic. The OCR is not accurate due to handwritten documents, so I need to fine-tune a model on around 300k pairs of texts (OCR output and manually corrected versions)
This sounds very interesting; can you share more? Thanks!
I followed this guide for fine-tuning: https://ai.google.dev/gemini-api/docs/model-tuning

Arabic OCR is a mess with historical texts. Take the word الف (alf/thousand) in dates like 1950 - in old documents, the ف (fa) had a dot below it, but modern OCR doesn't get this and outputs الد (alad), which is just gibberish in Arabic

Same problem with ق (qaf) written as ف (fa) in old Arabic

And don't get me started on merged letters! In محمد (Muhammad), sometimes the م (meem) sits right on top of the ح (haa), or appears as a little circle below the line. Modern OCR has no clue what to do with these

My solution? Run OCR first, then use LLMs to fix the mess based on context. The surprising part? In my tinkering, smaller fine-tuned models actually do BETTER at this specific task than the big general-purpose ones. They seem to learn the patterns of historical Arabic quirks more effectively. Pretty neat tradeoff of specialized knowledge vs. general intelligence

IMHO the biggest factor holding that back is how rushed and distanced these model releases are, still.

Both Phi-4-mini and Gemma 3 were released recently. Phi-4's damn close to a good, real, model release. Microsoft's done a great job of iterating.

Gemma 3's an excellent, intelligent, model, but it's got a gaping blind spot: tool-calling / JSON output. There was a vague quick handwave about it in some PR, a PM/eng on the Gemma team commented here in response to someone else that TL;DR "it's supported in Ollama!", which is Not Even Wrong, i.e. in the Pauli sense of the phrase.

- Ollama uses a weak, out of date llama.cpp thing where the output tokens are constrained to match a JSON schema. This falls apart almost immediately, i.e. as soon as there is more than one tool.

- The thing that matters isn't whether we can constrain output tokens, any model can do that, I've had Llama 3 1B making tool calls that way. The thing that matters is A) did you train that in and B) if you did, tell us the format

All that to say, IMHO we're still 6 months to a year out from BigCo understanding enough about their own stuff to even have a good base for it. Sure, tool calling and fine-tuning are orthogonal, in a sense, but in practice, if I'm interested in getting a specific type of output, odds are I wanted that formatted a specific way.

Gemma3 1B seems to be able to choose which tool to use for very simple cases, if you constrain using anyOf, and narrow it down to just a few with RAG first.

It can't understand numbers very well though, "one thousand five" might become "1500".

JSON constraints seem to make them unable to figure it out even if they'd normally get it every time.

Maybe it's different with models above 4B though.

could one train now a gemma 3 fine tune for tool use?

found this on HF https://huggingface.co/ZySec-AI/gemma-3-27b-tools

We were but with the models becoming so good, so large, and so cheap, we've largely abandoned it in our long-term roadmap.
I’m trying right now. The combination of small models, qlora and grpo has made it accessible to experimenters. I’m not using unsloth yet, but I will probably start checking it out pretty soon so that I can train larger models or increase the number of generations for grpo.
I am. I have some use cases related to data extraction where using a fine tuned small model outperforms the best-in-class closed source models and at a fraction of the cost.