|
|
|
|
|
by thorio
129 days ago
|
|
I challenged Gemini to answer this too, but also got the correct answer. What came to my mind was: couldn't all LLM vendors easily fund teams that only track these interesting edge cases and quickly deploy filters for these questions, selectively routing to more expensive models? Isn't that how they probably game benchmarks too? |
|
Like, this is not an architectural problem unlike the strawberry nonsense, it's some dumb kind of overfitting to a standard "walking is better" answer.