| HN Mirror

What I mean is, it’s important to train a model with and without books3. That’s the only way to know whether it was books3 itself causing the issue, or some artifact of the training process.

One thing that’s hard to measure is the knowledge contained in books3. If someone asks about certain books, it won’t be able to give an answer unless the knowledge is there in some form. I’ve often wondered whether scraping the internet is enough rather than training on books directly.

But be careful about relying too much on evals. Ultimately the only benchmark that matters is whether users find the model useful. The clearest test of this would be to train two models side by side, with and without books3, and then ask some people which they prefer.

It’s really tricky to get all of this right. But if there’s more details on the pes2o ablations I’d be curious to see.