| Yeah, that's my assumption too. Fine-tuning is really expensive in terms of skills and time needed to attempt it, and there's a very real chance that your attempts will fail to make a meaningful improvement over being smarter with your prompts. Even worse, even if you DO get an improvement you are likely to find that it was a waste of time in a month or two when then next upgraded version of the underlying models are released. The places it makes sense from what I can tell are mainly when you are running so many prompts that the cost saving by running a smaller, cheaper model can outweigh the labor and infrastructure costs involved in getting it to work. If your token spend isn't tens (probably hundreds) of thousands of dollars you're unlikely to save money like this. If it's not about cost saving, the other reasons are latency and being able to achieve something that the root model just couldn't do. Datadog reported a latency improvement, because fine-tuning let them run a much smaller (and hence faster) model. That's a credible reason if you are building high value features that a human being is waiting on, like live-typing features. The most likely cases I've heard of for getting the model to do something it just couldn't do before mainly involve vision LLMs, which makes sense to me - training a model to be able to classify images that weren't in the training set might make more sense than stuffing more example images into the prompt (though models like Gemini will accept dozens of not hundreds of comparable images in the prompt, which can then benefit from prompt caching). The last category is actually teaching it a new skill. The best example here are low-resource programming languages - Jane Street and OCaml or Morgan Stanley and Q for example. Jane Street OCaml: https://www.youtube.com/watch?v=0ML7ZLMdcl4 Morgan Stanley Q: https://huggingface.co/morganstanley/qqWen-1.5B-SFT |
Depending on use case you could either insert the LoRA weights as their own layers at runtime (no time to create, but extra layer to compute each token), merge them with existing layers (initial delay to merge layers, but no runtime penalty after), or have pre-merged models for common cases (no perf penalty but have to reserve more storage).