| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anndvision 323 days ago

We recently ran similar experiments and saw that fine-tuning small models on automatically curated high-quality outputs from a large model can beat large-model performance while reducing inference costs by up to 30x and inference time by up to 4x.

We benchmarked closed-source (OpenAI, Google) and open-source (Qwen) models on multi-turn maze navigation (BabyAI), agentic RAG (Multi-Hop), and agentic tool use (τ-bench).

We're still running a few experiments and plan to update the post with additional results in a few days.

Looking forward to trying out importance weighting soon!

Curated Behavior Cloning: Small LLMs Can Beat Large Ones at 5-30x Lower Cost: https://www.tensorzero.com/blog/curated-behavior-cloning-sma...

2 comments

chongliqin 323 days ago

Cool! If you are interested, we have open sourced our code: https://github.com/emmyqin/iw_sft

link

anndvision 323 days ago

thanks

link

TheTaytay 323 days ago

Thanks for this - I’ve spent the last hour reading your docs and blog. I like the primitives you’ve exposed in your APO, and particularly like the decision to separate out the structured inputs from the prompt when you record an LLM call, so I can finally perform optimizations and evals on past calls.

Quick question : you mentioned unsloth in the blog post. Which of the fine tuning providers mentioned is using unsloth under the hood?

link

GabrielBianconi 323 days ago

[I'm his coworker.] We ran Unsloth ourselves on a GPU-by-the-hour server. We have a notebook in the repository showing how to query historical data and use it with Unsloth.

It's a WIP PR that we plan to merge soon: https://github.com/tensorzero/tensorzero/pull/2273

link