|
|
|
|
|
by column
215 days ago
|
|
"[a photoshopped picture of a dog with 5 legs]...please count the legs" Meanwhile you could benchmark for something actually useful. If you're about to say "But that means it won't work for my use case of identifying a person on a live feed" or whatever, then why don't you test that? I really don't understand the kick people get of successfully tricking LLMs on non productive task with no real world application. Just like the "how many r in strawberry?", "uh uh uh it says two urh urh".. ok but so what? What good is a benchmark that is so far from a real use case? |
|
It's a perfectly valid benchmark and very telling.