Evaluating 55 LLMs with GPT-4

Y	Hacker News new \| ask \| show \| jobs

	Evaluating 55 LLMs with GPT-4 (benchmarks.llmonitor.com)
	36 points by vincelt 1027 days ago

7 comments

bradknowles 1027 days ago

How is this benchmark not inherently biased towards GPT?

If I did the same sort of thing but used Claude to grade the tests, would I get similar results? Or would that be inherently biased towards Claude scoring high?

link

crashocaster 1027 days ago

I always find evals of this flavor offputting given that 3.5 and 4 likely share preference models (or at least feedback data)

link

habitue 1027 days ago

Should be evaluating each prompt multiple times to see how much variance in the scores there are. Even gpt-4 grading gpt-4 should probably be done several times

link

natsucks 1026 days ago

Why no multi-turn evaluation? A lot of these benchmarks fail to capture the strength of ghost attention used in Llama 2 chat models.

link

aiunboxed 1027 days ago

Any reason why palm or cohere models are not here ?

link

jasonjmcghee 1026 days ago

Palm 2 is tied for #10

link

londons_explore 1026 days ago

GPT-4-0314 is top of the league table (ie. Not the latest version, but the version released in March).

Is this our Concorde moment?

link

ionwake 1027 days ago

Really cool thanks

link