|
|
|
|
|
by echelon
515 days ago
|
|
This has me curious about ARC-AGI. Would it have been possible for OpenAI to have gamed ARC-AGI by seeing the first few examples and then quickly mechanical turking a training set, fine tuning their model, then proceeding with the rest of the evaluation? Are there other tricks they could have pulled? It feels like unless a model is being deployed to an impartial evaluator's completely air gapped machine, there's a ton of room for shenanigans, dishonesty, and outright cheating. |
|
In the o3 announcement video, the president of ARC Prize said they'd be partnering with OpenAI to develop the next benchmark.
> mechanical turking a training set, fine tuning their model
You don't need mechanical turking here. You can use an LLM to generate a lot more data that's similar to the official training data, and then you can train on that. It sounds like "pulling yourself up by your bootstraps", but isn't. An approach to do this has been published, and it seems to be scaling very well with the amount of such generated training data (They won the 1st paper award)