> We evaluate model performance and find that frontier models are still unable to solve the majority of tasks.