|
|
|
|
|
by energy123
60 days ago
|
|
> 93.6% (congrats Anthropic) But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission" 0.191 * 0.594 > 1 - 0.936 Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means? |
|