|
|
|
|
|
by nabakin
911 days ago
|
|
It's much more accurate than the Open LLM Leaderboard, that's for sure. Human evaluation has always been the gold standard. I just wish we could filter by the votes which were made after only one or two prompts and I hope they don't include the non-blind votes in the results. |
|
-Ask any question to two anonymous models (e.g., ChatGPT, Claude, Llama) and vote for the better one!
-You can continue chatting until you identify a winner.
-Vote won’t be counted if model identity is revealed during conversation.