Hacker News new | ask | show | jobs
by ColonelPhantom 85 days ago
If generating synthetic data is such a great way to improve performance, why would it not be applied to the slowrun? Especially for the unlimited compute track, you should have plenty of time to generate as much synthetic data as your heart desires.

Intuitively, I would expect the synthetic data to mostly just "regurgitate" the existing data, and not add much. But I could be wrong of course, and perhaps doing reinforcement learning somewhere could solve that issue as well (though I don't know if there is much hidden in FineWeb that you could RL on; at best you can do self-verification probably?)

1 comments

There's some evidence that carefully chosen synthetics might convey useful priors, improving convergence speed, generalizaiton and final performance.

Just the other day this was posted, for example: https://news.ycombinator.com/item?id=47388293

Interesting; I was not aware of those "universal synthetics" but they make sense: a stronger reasoning base would make modeling tasks easier. Thanks for the link!

Again, though, if those work I assume they will be used for the slowrun. Surely a few hundred LoC to generate data would not be considered cheating :)