| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by YetAnotherNick 793 days ago
	Yes, I don't know how people don't realize how much cheap tricks works in Chatbot Arena. A single base model produces 100s of ELO difference depending on the way it is tuned. And on most cases, instruction tuning heavily slightly even decreases reasoning ability on standard benchmark. You can see base model scores better in MMLU/ARC most of the times in huggingface leaderboard. Even GPT-4-1106 seems to only sounds better than GPT-4-0613 and works for wider range of prompt. But in a well defined prompt and follow up questions I don't think there is an improvement in reasoning.

1 comments

imtringued 793 days ago

When I tried Phi2 it was just bad. I don't know where you got this fantasy from that people accept obviously wrong answers, because of "pandering".

link

YetAnotherNick 793 days ago

Obviously correct answer matters more but ~100-200 elo points could be gained just for better writing. Answer would be range of 500 elo in comparison.

link

rgbrgb 793 days ago

> just for better writing

in my use cases, better writing makes a better answer

link