|
|
|
|
|
by kristopolous
143 days ago
|
|
I made a humor evals https://github.com/kristopolous/humor-evals Here's results for 34 models (testing a few more right now). So far gemini-3-flash-preview is in the lead. https://docs.google.com/spreadsheets/d/1wLqHA0ohxukgPLpSgklz... 50 is coin-toss odds. The dataset is 195,000 Reddit jokes with scores presented with pairs of jokes (one highly upvoted, one poorly rated). Example prompt: Which joke from reddit is funnier? Reply only "A" or "B". Do not be conversational.
<Joke A><setup>Son: "Dad, Am I adopted"?</setup>
<punchline>Dad: "Not yet. We still haven't found anyone who wants you."</punchline></Joke A>
<Joke B><setup>Knock Knock</setup>
<punchline>Who's there?
Me.
Me who?
I didn't know you had a cat.</punchline></Joke B> This is my first crack at evals. I'm open to improvements. |
|