Hacker News new | ask | show | jobs
by sebzim4500 1047 days ago
It's not easy to compare them, to be fair.

I guess you could come up with a thousand example prompts and pay some students to pick which output is better, but I can also see why you wouldn't bother. It probably depends on language, type of prompt, etc.

2 comments

Sure it's easy -- you can use benchmarks like HumanEval, which Stability did. They just didn't compare to Codex or GPT-4. Of course such benchmarks don't capture all aspects of an LLM's capabilities, but they're a lot better than nothing!
One could team up with Hackerrank/leetcode, let the model code in the interface (maybe there's an API for that already, no idea), execute their code verbatim and see how many test cases they get right the first time around. Then, like for humans, give them a clue about one of the tests no passing (or code not working, too slow, etc.). Give points based on the difficulty of the question and the number of clues needed.

I guess the obvious caveat is that these model are probably overfitted on these types of questions. But a specific benchmark could be made containing question kept secret for models. Time to build "Botrank" I guess.