|
|
|
|
|
by shevy-java
124 days ago
|
|
So the best one found about 50%. I think that is not bad,
probably better than most humans. But what about the remaining
50%? Why were some found and others not? > Claude Opus 4.6 found it… and persuaded itself there is nothing to worry about
> Even the best model in our benchmark got fooled by this task. That is quite strange. Because it seems almost as if a human is
required to make the AI tools understand this. |
|