|
|
|
|
|
by silvr
394 days ago
|
|
While true to a degree, I think this is largely wrong. Wouldn't it still count as a "harness" if we provided these LLMs with full robotic control of two humanoid arms, so that it could hold a Gameboy and play the game that way? I don't think the lack of that level of human-ness takes away from the demonstration of long-context reasoning that the GPP stream showed. Claude got stuck reasoning its way through one of the more complex puzzle areas. Gemini took a while on it also, but made it through. I don't that difference can be fully attributed up to the harnesses. Obviously, the best thing to do would be to run a SxS in the same harness of the two models. Maybe that will happen? |
|
Basically, the gane being conpleted by gemini was in an inferior category (however minuscule) of experiment.
I get it though. People demanded these types of changes in the CPP twitch chat, because the pain of watching the model fail in slow motion is simply too much.