|
|
|
|
|
by pshirshov
5 days ago
|
|
Yes, I know all the flaws. As I said, it's not an objective way to measure performance of a model - but it is intended to produce something that only humans could mesaure. The goal is for you to being able to play the game and judge - and fill the human checklist for yourself if you wish. You didn't get why the automatic review scores are there - all of the reviewers, including Fable, happily assign highest scores to code which can't even run. In my opinion that is a sort of an empirical evidence that these models are very far from the "AGI" state. Anyway, while I didn't explain the methodology and the purpose of this experiment, I have something material to discuss. The "awesome Fable" claims are not material at all. Can you bring something clearly showcasing Fable's superiority? |
|