Hacker News new | ask | show | jobs
by gertlabs 32 days ago
This is what we do at gertlabs.com - the foundation labs are actually starving for better data. Having quality data is not the same as having a lot of data. Human curated data / RLHF cannot scale to a 5T model and synthetic data pipelines are very much a work in progress in the industry.

Some interesting notes:

- Training a small model with large model output resulted in LESS improvement than distilling a less smart model onto the same small architecture [0]. We are starting to hit intelligence density limits in small models (<30B models may be nearing saturation now)

- good RL environments incidentally also make for good benchmarking

[0] https://arxiv.org/html/2502.12143v1

1 comments

Wouldn’t it be good to start investigating into a micro model architecture? Like first model checks the context and routes to the Java optimized model, etc. would make it also simpler to load/unload models in memory.

So extremely small models that are only good for a certain task like programming languages. A little bit of a model at the front that is extremely good in classification of tasks and than a more complex model that can bring each of these micro models back together

My guess is that we underestimate how much non-Java data and context in general is needed to create a good Java coding model. It could be true that a good Java model would be of 80-90% the size of a comparable overall coding model.

Obviously, I have no idea but I guess it’s not as simple as “just train only on Java code and reduce size to 1/10th”.

I think you're describing Mixture-of-Experts.