|
|
|
|
|
by causal
89 days ago
|
|
Thanks, I mostly agree with your approach except for one thing: eyesight feels like a "harness" that humans get to use and LLMs do not. I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability. |
|
(This version of the benchmark would be several orders of magnitude harder wrt current capabilities...)