| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gertlabs 21 days ago
	Success rate includes syntax/compilation failures as well as environment rule violations, and is almost entirely from one-shot code generations. Percentile shows how well the working submissions perform. In long horizon agentic coding evaluations, strong models fix the syntax and percentile and it becomes a direct comparison of which submissions per language performed the best on average. You can filter for that here: https://gertlabs.com/rankings?provider=openai&mode=agentic_c...