Hacker News new | ask | show | jobs
by mluo 499 days ago
We beat O1-preview and even many other 7B models over many math benchmarks, which was TEST set (not in training set at all).

If you want to make the model fully generalist, feel free to train it over coding datasets (such as RL with passing unit tests as reward).

3 comments

It's already good accomplishment as it is but I think it'd be very surprising to show training such a small model as a generalist scales to the same magnitude as specialized finetuning. At some point you have to fit more background data and relations in the same amount of information space... but it's hard to say how much that is the case for a given size vs what we just haven't optimized yet. Unfortunately I think that will have to wait for someone with more compute before we can verify this * a dozen one way or the other :).

Side question, since it sounds like you were involved: how big is the impact on benchmarks of taking this 1.5B model down from fp32 to fp8 or similar? The focus on parameters alone sometimes feels like comparing house sizes by their lengths alone. And, if you were indeed involved, thanks for making all of this open and available!

For quantization, very big impact for small models, can drop at much as 10% on AIME. Our model does best on bfloat16 ;)

Come checkout our repo at: https://github.com/agentica-project/deepscaler

It is great discovery, it could even open a next step in AI with MoM "Mixture of Models", where small fine-tuned models take each part of a task (instead of the current MoE)
Check out one of my prior work: https://stylus-diffusion.github.io/

This work scales up selection/routing over many models/LoRAs

Love it, will check, thank you for showing / sharing all of that!
o1 is more than just math solver. And you cannot possibly train that much in a small model.

However smaller specialized models looks to be the right way to handle world's complexity. Sort of mixture of experts on one level above. Orchestrating them will be another problem. Possible solution is generalists model "to rule them all".

Have you considered the very practical importance of running specialized models for specialized tasks on common hardware (maybe a couple of CPU cores in a couple GB of RAM)?
Small models are just tools. Even many of them will make only a toolset. They don't evolve in AGI by themselves. But putting them together in a structure (brain) may result in something close. Like big smart calculator. It takes more to create a 'character' similar to, say, terminator.