Hacker News new | ask | show | jobs
by kristopolous 143 days ago
I made a humor evals https://github.com/kristopolous/humor-evals

Here's results for 34 models (testing a few more right now). So far gemini-3-flash-preview is in the lead.

https://docs.google.com/spreadsheets/d/1wLqHA0ohxukgPLpSgklz...

50 is coin-toss odds. The dataset is 195,000 Reddit jokes with scores presented with pairs of jokes (one highly upvoted, one poorly rated).

Example prompt:

Which joke from reddit is funnier? Reply only "A" or "B". Do not be conversational. <Joke A><setup>Son: "Dad, Am I adopted"?</setup> <punchline>Dad: "Not yet. We still haven't found anyone who wants you."</punchline></Joke A> <Joke B><setup>Knock Knock</setup> <punchline>Who's there? Me. Me who? I didn't know you had a cat.</punchline></Joke B>

This is my first crack at evals. I'm open to improvements.

1 comments

Try Kimi K2 (not the new 2.5), it's known for its default voice being decidedly casual and different from most models.