Hacker News new | ask | show | jobs
by musebox35 51 days ago
They say that they did test but the coverage was not enough to pick it up, at least for the prompt change:

“ After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16.

As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.”

Considering the number and scope of users they serve, I can sympathize with the difficulty. However, they should reimburse affected users at least partially instead of just announcing “our bad, sorry “. That would reduce the frustration.

1 comments

Naively, one could assume that with AI it should be possible to create a long and broad list of test cases…