| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jakobov 764 days ago

Gotcha. That makes sense. Thanks!

What are the theories as to why this works better than training on a larger quantity of non-simulated tokens?

Is it because the gradient from the non-simulated tokens is too noisy for a small model to model correctly?