|
|
|
|
|
by jakobov
714 days ago
|
|
Gotcha. That makes sense. Thanks! What are the theories as to why this works better than training on a larger quantity of non-simulated tokens? Is it because the gradient from the non-simulated tokens is too noisy for a small model to model correctly? |
|