Hacker News new | ask | show | jobs
by YetAnotherNick 793 days ago
Yes, I don't know how people don't realize how much cheap tricks works in Chatbot Arena. A single base model produces 100s of ELO difference depending on the way it is tuned. And on most cases, instruction tuning heavily slightly even decreases reasoning ability on standard benchmark. You can see base model scores better in MMLU/ARC most of the times in huggingface leaderboard.

Even GPT-4-1106 seems to only sounds better than GPT-4-0613 and works for wider range of prompt. But in a well defined prompt and follow up questions I don't think there is an improvement in reasoning.

1 comments

When I tried Phi2 it was just bad. I don't know where you got this fantasy from that people accept obviously wrong answers, because of "pandering".
Obviously correct answer matters more but ~100-200 elo points could be gained just for better writing. Answer would be range of 500 elo in comparison.
> just for better writing

in my use cases, better writing makes a better answer