|
|
|
|
|
by stared
106 days ago
|
|
I rerun it for GPT-5.2-Codex, for high and xhigh. Finally, it matches my experience, and it is actually good (as good as the best models for localization, still impressive 0% false positive rate):
https://quesma.com/benchmarks/binaryaudit/ Will rerun it on GPT-5.3-Codex shortly, as API is out (yet, the effort does not work correctly, and for "medium" it is very low). |
|