|
|
|
|
|
by karmasimida
793 days ago
|
|
Even random seed could cause bad big shift in human eval performance if you know you know. It is perfectly illegal to choose one ckpt that looks best on those benchmarks and move along HumanEval is meaningless regardless, those 164 problems have been overfit to the tea. Hook this up to LLM arena we will get a better picture regarding how powerful they really are |
|