|
|
|
|
|
by nopinsight
126 days ago
|
|
From Claude 4.6 Thinking: OSWorld is the full 369-task benchmark. OSWorld Verified is a ~200-task subset where humans have confirmed the eval scripts reliably score success/failure — the full set has some noisy grading where correct actions can still get marked wrong. Scores on Verified tend to run higher, so they're not directly comparable. |
|